Whole cell engineering by mutagenizing a substantial portion of a starting genome combining mutations and optionally repeating

ABSTRACT

This invention relates to the field of cellular and whole organism engineering. Specifically, this invention relates to a cellular transformation, directed evolution, and screening method for creating novel transgenic organisms having desirable properties. Thus in one aspect, this invention relates to a method of generating a transgenic organism, such as a microbe or a plant, having a plurality of traits that are diffenentially activatable.

A—FIELD OF THE INVENTION

This invention relates to the field of cellular and whole organismengineering. Specifically, this invention relates to a cellulartransformation, directed evolution, and screening method for creatingnovel transgenic organisms having desirable properties. Thus in oneaspect, this invention relates to a method of generating a transgenicorganism, such as a microbe or a plant, having a plurality of traitsthat are differentially activatable.

This invention also relates to the field of protein engineering.Specifically, this invention relates to a directed evolution method forpreparing a polynucleotide encoding a polypeptide. More specifically,this invention relates to a method of using mutagenesis to generate anovel polynucleotide encoding a novel polypeptide, which novelpolypeptide is itself an improved biological molecule &/or contributesto the generation of another improved biological molecule. Morespecifically still, this invention relates to a method of performingboth non-stochastic polynucleotide chimerization and non-stochasticsite-directed point mutagenesis.

Thus, in one aspect, this invention relates to a method of generating aprogeny set of chimeric polynucleotide(s) by means that are syntheticand non-stochastic, and where the design of the progenypolynucleotide(s) is derived by analysis of a parental set ofpolynucleotides &/or of the polypeptides correspondingly encoded by theparental polynucleotides. In another aspect this invention relates to amethod of performing site-directed mutagenesis using means that areexhaustive, systematic, and non-stochastic.

Furthermore this invention relates to a step of selecting from among agenerated set of progeny molecules a subset comprised of particularlydesirable species, including by a process termed end-selection, whichsubset may then be screened further. This invention also relates to thestep of screening a set of polynucleotides for the production of apolypeptide &/or of another expressed biological molecule having auseful property.

Novel biological molecules whose manufacture is taught by this inventioninclude genes, gene pathways, and any molecules whose expression isaffected thereby, including directly encoded polypetides &/or anymolecules affected by such polypeptides. Said novel biological moleculesinclude those that contain a carbohydrate, a lipid, a nucleic acid, &/ora protein component, and specific but non-limiting examples of theseinclude antibiotics, antibodies, enzymes, and steroidal andnon-steroidal hormones.

In a particular non-limiting aspect, the present invention relates toenzymes, particularly to thermostable enzymes, and to their generationby directed evolution. More particularly, the present invention relatesto thermostable enzymes which are stable at high temperatures and whichhave improved activity at lower temperatures.

B—BACKGROUND General Overview of the Problem to be Solved

Brief Summary: It is instantly appreciated that the process ofperforming a genetic manipulation on a organism to achieve a geneticalteration, whether it is on a unicellular or on a multi-cellularorganism, can lead to harmful, toxic, noxious, or even lethal effects onthe manipulated organism. This is particularly true when the geneticmanipulation becomes sizable. From a technical point of view, thisproblem is seen as one of the current obstacles that hinder the creationof genetically altered organisms having a large number of transgenictraits.

On the marketing side, is instantly appreciated that the purchase priceof a genetically altered organism is often dictated by, or proportionalto, the number of transgenic traits that have been introduced into theorganism. Consequently, a genetically altered organism having a largenumber of stacked transgenic traits can be quite costly to produce andpurchase and economically in low demand.

On the other hand, the generation of organism having but a singlegenetically introduced trait can also lead to the incurrence ofundesirable costs, although for other reasons. It is thus appreciatedthat the separate production, marketing, & storage of geneticallyaltered organisms each having a single transgenic traits can incurcosts, including inventory costs, that are undesirable. For example, thestorage of such organisms may require a separate bin to be used for eachtrait. Furthermore, the value of an organisms having a single particulartrait is often intimately tied to the marketability of that particulartrait, and when that marketability diminishes, inventories of suchorganisms cannot be sold in other markets.

The instant invention solves these and other problems by providing amethod of producing genetically altered organisms having a large numberof stacked traits that are differentially activatable. Upon purchasingsuch a genetically altered organism (having a large number ofdifferentially activatable stacked traits), the purchasing customer hasthe option of selecting and paying for particular traits among the totalthat can then be activated differentially. One economic advantageprovided by this invention is that the storage of such geneticallyaltered organisms is simplified since, for example, one bin could beused to store a large number of traits. Moreover, a single organism ofthis type can satisfy the demands for a variety of traits; consequently,such an organism can be sold in a variety of markets.

To achieve the production of genetically altered organisms having alarge number of stacked traits that are differentially activatable, thisinvention provides—in one specific aspect—a process comprising the stepof monitoring a cell or organism at holistic level. This serves as a wayof collecting holistic—rather than isolated—information about a workingcell or organism that is being subjected to a substantial amount ofgenetic manipulation. This invention further provides that this type ofholistic monitoring can include the detection of all morphological,behavioral, and physical parameters.

Accordingly, the holistic monitoring provided by this invention caninclude the identification &/or quantification of all the geneticmaterial contained in a working cell or organism (e.g. all nucleic acidsincluding the entire genome, messenger RNA's, tRNA's, rRNA's, andmitochondrial nucleic acids, plasmids, phages, phagemids, viruses, aswell as all episomal nucleic acids and endosymbiont nucleic acids).Furthermore this invention provides that this type of holisticmonitoring can include all gene products produced by the working cell ororganisms.

Furthermore, the holistic monitoring provided by this invention caninclude the identification &/or quantification of all molecules that arechemically at least in part protein in a working cell or organism. Theholistic monitoring provided by this invention can also include theidentification &/or quantification of all molecules that are chemicallyat least in part carbohydrate in a working cell or organism. Theholistic monitoring provided by this invention can also include theidentification &/or quantification of all molecules that are chemicallyat least in part proteoglycan in a working cell or organism. Theholistic monitoring provided by this invention can also include theidentification &/or quantification of all molecules that are chemicallyat least in part glycoprotein in a working cell or organism. Theholistic monitoring provided by this invention can also include theidentification &/or quantification of all molecules that are chemicallyat least in part nucleic acids in a working cell or organism. Theholistic monitoring provided by this invention can also include theidentification &/or quantification of all molecules that are chemicallyat least in part lipids in a working cell or organism.

In one aspect, this invention provides that the ability todifferentially activate a trait from among many, such as a enzyme fromamong many enzymes, depends the enzyme(s) to be activated having aunique activity profile (or activity fingerprint). An enzyme's activityprofile includes the reaction(s) it catalyzes and its specificity. Thus,an enzymes activity profile includes its:

-   -   Catalyzed reaction(s)    -   Reaction type    -   Natural substrate(s)    -   Substrate spectrum    -   Product spectrum    -   Inhibitor(s)    -   Cofactor(s)/prostetic group(s)    -   Metal compounds/salts that affect it    -   Turnover number    -   Specific activity    -   Km value    -   pH optimum    -   pH range    -   Temperature optimum    -   Temperature range

It is also instantly appreciated that enzymes are differentiallyaffected by exposure to varying degrees of processing (e.g. uponextraction &/or purification) and exposure (e.g. to suboptimal storageconditions). Accordingly, enzyme differences may surface after exposureto:

-   -   Isolation/Preparation    -   Purification    -   Crystallization    -   Renaturation

It is instantly appreciated that differences in molecular stability canalso be used advantageously to differentially activate or inactivateselected enzymes, by exposing the enzymes for an appropriate time tovariations in:

-   -   pH    -   Temperature    -   Oxidation    -   Organic solvent(s)    -   Miscellaneous storage conditions

It is thus appreciated that in order to be able to differentiallyactivate selected traits among a plurality of stacked traits, it isdesirable to introduce into a working cell or organism traits conferredby molecules (e.g. enzymes) having very unique profiles (e.g. uniqueenzyme fingerprints). Furthermore, it is appreciated that in order toobtain the molecules having a representation of a wide range ofmolecular fingerprints, it is advantageous to harvest molecules from thewidest possible reaches nature's diversity. Thus, it is beneficial toharvest molecules not only from cultured mesophilic organisms, but alsofrom extremophiles that are largely uncultured.

In another aspect, it is instantly appreciated that harvesting the fullpotential of nature's diversity can include both the step of discoveryand the step of optimizing what is discovered. For example, the step ofdiscovery allows one to mine biological molecules that have commercialutility. It is instantly appreciated that the ability to harvest thefull richness of biodiversity, i.e. to mine biological molecules from awide range of environmental conditions, is critical to the ability todiscover novel molecules adapted to function under a wide variety ofconditions, including extremes of conditions, such as may be found in acommercial application.

However, it is also instantly appreciated that only occassionally arethere criteria for selection &/or survival in nature that point in theexact direction of particular commercial needs. Instead, it is often thecase that a naturally occurring molecule will require a certain amountof change—from fine tuning to sweeping modification—in order to fulfilla particular unmet commercial need. Thus, to meet certain commercialneeds (e.g., a need for a molecule that is fucntional under a specificset of commercial processing conditions) it is sometimes advantageous toexperimentally modify a naturally expresed molecule to achieveproperties beyond what natural evolution has provided &/or is likely toprovide in the near future.

The approach, termed directed evolution, of experimentally modifying abiological molecule towards a desirable property, can be achieved bymutagenizing one or more parental molecular templates and by idendifyingany desirable molecules among the progeny molecules. Currently availabletechnologies in directed evolution include methods for achievingstochastic (i.e. random) mutagenesis and methods for achievingnon-stochastic (non-random) mutagenesis. However, critical shortfalls inboth types of methods are identified in the instant disclosure.

In prelude, it is noteworthy that it may be argued philosophically bysome that all mutagenesis—if considered from an objective point ofview—is non-stochastic; and furthermore that the entire universe isundergoing a process that—if considered from an objective point ofview—is non-stochastic. Whether this is true is outside of the scope ofthe instant consideration. Accordingly, as used herein, the terms“randomness”, “uncertainty”, and “unpredictability” have subjectivemeanings, and the knowledge, particularly the predictive knowledge, ofthe designer of an experimental process is a determinant of whether theprocess is stochastic or non-stochastic.

By way of illustration, stochastic or random mutagenesis is exemplifiedby a situation in which a progenitor molecular template is mutated(modified or changed) to yield a set of progeny molecules havingmutation(s) that are not predetermined. Thus, in an in vitro stochasticmutagenesis reaction, for example, there is not a particularpredetermined product whose production is intended; rather there is anuncertainty—hence randomness—regarding the exact nature of the mutationsachieved, and thus also regarding the products generated. In contrast,non-stochastic or non-random mutagenesis is exemplified by a situationin which a progenitor molecular template is mutated (modified orchanged) to yield a progeny molecule having one or more predeterminedmutations. It is appreciated that the presence of background products insome quantity is a reality in many reactions where molecular processingoccurs, and the presence of these background products does not detractfrom the non-stochastic nature of a mutagenesis process having apredetermined product.

Thus, as used herein, stochastic mutagenesis is manifested in processessuch as error-prone PCR and stochastic shuffling, where the mutation(s)achieved are random or not predetermined. In contrast, as used herein,non-stochastic mutagenesis is manifested in instantly disclosedprocesses such as gene site-saturation mutagenesis and syntheticligation reassembly, where the exact chemical structure(s) of theintended product(s) are predetermined.

In brief, existing mutagenesis methods that are non-stochastic have beenserviceable in generating from one to only a very small number ofpredetermined mutations per method application, and thus produce permethod application from one to only a few progeny molecules that havepredetermined molecular structures. Moreover, the types of mutationscurrently available by the application of these non-stochastic methodsare also limited, and thus so are the types of progeny mutant molecules.

In contrast, existing methods for mutagenesis that are stochastic innature have been serviceable for generating somewhat larger numbers ofmutations per method application—though in a random fashion & usuallywith a large but unavoidable contingency of undesirable backgroundproducts. Thus, these existing stochastic methods can produce per methodapplication larger numbers of progeny molecules, but that haveundetermined molecular structures. The types of mutations that can beachieved by application of these current stochastic methods are alsolimited, and thus so are the types of progeny mutant molecules.

It is instantly appreciated that there is a need for the development ofnon-stochastic mutagenesis methods that:

1) Can be used to generate large numbers of progeny molecules that havepredetermined molecular structures;

-   -   2) Can be used to readily generate more types of mutations;    -   3) Can produce a correspondingly larger variety of progeny        mutant molecules;    -   4) Produce decreased unwanted background products;    -   5) Can be used in a manner that is exhaustive of all        possibilities; and    -   6) Can produce progeny molecules in a systematic &        non-repetitive way.

The instant invention satisfies all of these needs.

Directed Evolution Supplements Natural Evolution: Natural evolution hasbeen a springboard for directed or experimental evolution, serving bothas a reservoir of methods to be mimicked and of molecular templates tobe mutagenized. It is appreciated that, despite its intrinsicprocess-related limitations (in the types of favored &/or allowedmutagenesis processes) and in its speed, natural evolution has had theadvantage of having been in process for millions of years & andthroughout a wide diversity of environments. Accordingly, naturalevolution (molecular mutagenesis and selection in nature) has resultedin the generation of a wealth of biological compounds that have shownusefulness in certain commercial applications.

However, it is instantly appreciated that many unmet commercial needsare discordant with any evolutionary pressure &/or direction that can befound in nature. Moreover, it is often the case that when commerciallyuseful mutations would otherwise be favored at the molecular level innature, natural evolution often overrides the positive selection of suchmutations, e.g. when there is a concurrent detriment to an organism as awhole (such as when a favorable mutation is accompanied by a detrimentalmutation). Additionally, natural evolution is often slow, and favorsfidelity in many types of replication. Additionally still, naturalevolution often favors a path paved mainly by consecutive beneficialmutations while tending to avoid a plurality of successive negativemutations, even though such negative mutations may prove beneficial whencombined, or may lead—through a circuitous route—to final state that isbeneficial.

Moreover, natural evolution advances through specific steps (e.g.specific mutagenesis and selection processes), with avoidance of lessfavored steps. For example, many nucleic acids do not reach close enoughproximity to each other in a operative environment to undergochimerization or incorporation or other types of transfers from onespecies to another. Thus, e.g., when sexual intercourse between 2particular species is avoided in nature, the chimerization of nucleicacids from these 2 species is likewise unlikely, with parasites commonto the two species serving as an example of a very slow passageway forinter-molecular encounters and exchanges of DNA. For another example,the generation of a molecule causing self-toxicity or self-lethality orsexual sterility is avoided in nature. For yet another example, thepropagation of a molecule having no particular immediate benefit to anorganism is prone to vanish in subsequent generations of the organism.Furthermore, e.g., there is no selection pressure for improving theperformance of molecule under conditions other than those to which it isexposed in its endogenous environment; e.g. a cytoplasmic molecule isnot likely to acquire functional features extending beyond what isrequired of it in the cytoplasm. Furthermore still, the propagation of abiological molecule is susceptible to any global detrimentaleffects—whether caused by itself or not—on its ecosystem. These andother characteristics greatly limit the types of mutations that can bepropagated in nature.

On the other hand, directed (or experimental) evolution—particularly asprovided herein—can be performed much more rapidly and can be directedin a more streamlined manner at evolving a predetermined molecularproperty that is commercially desirable where nature does not provideone &/or is not likely to provide. Moreover, the directed evolutioninvention provided herein can provide more wide-ranging possibilities inthe types of steps that can be used in mutagenesis and selectionprocesses. Accordingly, using templates harvested from nature, theinstant directed evolution invention provides more wide-rangingpossibilities in the types of progeny molecules that can be generatedand in the speed at which they can be generated than often nature itselfmight be expected to in the same length of time.

In a particular exemplification, the instantly disclosed directedevolution methods can be applied iteratively to produce a lineage ofprogeny molecules (e.g. comprising successive sets of progeny molecules)that would not likely be propagated (i.e., generated &/or selected for)in nature, but that could lead to the generation of a desirabledownstream mutagenesis product that is not achievable by naturalevolution.

Previous Directed Evolution Methods are Suboptimal:

Mutagenesis has been attempted in the past on many occasions, but bymethods that are inadequate for the purpose of this invention. Forexample, previously described non-stochastic methods have beenserviceable in the generation of only very small sets of progenymolecules (comprised often of merely a solitary progeny molecule). Byway of illustration, a chimeric gene has been made by joining 2polynucleotide fragments using compatible sticky ends generated byrestriction enzyme(s), where each fragment is derived from a separateprogenitor (or parental) molecule. Another example might be themutagenesis of a single codon position (i.e. to achieve a codonsubstitution, addition, or deletion) in a parental polynucleotide togenerate a single progeny polynucleotide encoding for a singlesite-mutagenized polypeptide.

Previous non-stochastic approaches have only been serviceable in thegeneration of but one to a few mutations per method application. Thus,these previously described non-stochastic methods thus fail to addressone of the central goals of this invention, namely the exhaustive andnon-stochastic chimerization of nucleic acids. Accordingly previousnon-stochastic methods leave untapped the vast majority of the possiblepoint mutations, chimerizations, and combinations thereof, which maylead to the generation of highly desirable progeny molecules.

In contrast, stochastic methods have been used to achieve larger numbersof point mutations and/or chimerizations than non-stochastic methods;for this reason, stochastic methods have comprised the predominantapproach for generating a set of progeny molecules that can be subjectedto screening, and amongst which a desirable molecular species mighthopefully be found. However, a major drawback of these approaches isthat—because of their stochastic nature—there is a randomness to theexact components in each set of progeny molecules that is produced.Accordingly, the experimentalist typically has little or no idea whatexact progeny molecular species are represented in a particular reactionvessel prior to their generation. Thus, when a stochastic procedure isrepeated (e.g. in a continuation of a search for a desirable progenymolecule), the re-generation and re-screening of previously discardedundesirable molecular species becomes a labor-intensive obstruction toprogress, causing a circuitous—if not circular—path to be taken. Thedrawbacks of such a highly suboptimal path can be addressed bysubjecting a stochastically generated set of progeny molecules to alabor-incurring process, such as sequencing, in order to identify theirmolecular structures, but even this is an incomplete remedy.

Moreover, current stochastic approaches are highly unsuitable forcomprehensively or exhaustively generating all the molecular specieswithin a particular grouping of mutations, for attributing functionalityto specific structural groups in a template molecule (e.g. a specificsingle amino acid position or a sequence comprised of two or more aminoacids positions), and for categorizing and comparing specific groupingof mutations. Accordingly, current stochastic approaches do notinherently enable the systematic elimination of unwanted mutagenesisresults, and are, in sum, burdened by too many inherently shortcomingsto be optimal for directed evolution.

In a non-limiting aspect, the instant invention addresses these problemsby providing non-stochastic means for comprehensively and exhaustivelygenerating all possible point mutations in a parental template. Inanother non-limiting aspect, the instant invention further providesmeans for exhaustively generating all possible chimerizations within agroup of chimerizations. Thus, the aforementioned problems are solved bythe instant invention.

Specific shortfalls in the technological landscape addressed by thisinvention include:

-   -   1) Site-directed mutagenesis technologies, such as sloppy or        low-fidelity PCR, are ineffective for systematically achieving        at each position (site) along a polypeptide sequence the full        (saturated) range of possible mutations (i.e. all possible amino        acid substitutions).    -   2) There is no relatively easy systematic means for rapidly        analyzing the large amount of information that can be contained        in a molecular sequence and in the potentially colossal number        or progeny molecules that could be conceivably obtained by the        directed evolution of one or more molecular templates.    -   3) There is no relatively easy systematic means for providing        comprehensive empirical information relating structure to        function for molecular positions.    -   4) There is no easy systematic means for incorporating internal        controls, such as positive controls, for key steps in certain        mutagenesis (e.g. chimerization) procedures.    -   5) There is no easy systematic means to select for a specific        group of progeny molecules, such as full-length chimeras, from        among smaller partial sequences.

An exceedingly large number of possibilities exist for the purposefuland random combination of amino acids within a protein to produce usefulhybrid proteins and their corresponding biological molecules encodingfor these hybrid proteins, i.e., DNA, RNA. Accordingly, there is a needto produce and screen a wide variety of such hybrid proteins for adesirable utility, particularly widely varying random proteins.

The complexity of an active sequence of a biological macromolecule(e.g., polynucleotides, polypeptides, and molecules that are comprisedof both polynucleotide and polypeptide sequences) has been called itsinformation content (“IC”), which has been defined as the resistance ofthe active protein to amino acid sequence variation (calculated from theminimum number of invariable amino acids (bits) required to describe afamily of related sequences with the same function). Proteins that aremore sensitive to random mutagenesis have a high information content.

Molecular biology developments, such as molecular libraries, haveallowed the identification of quite a large number of variable bases,and even provide ways to select functional sequences from randomlibraries. In such libraries, most residues can be varied (althoughtypically not all at the same time) depending on compensating changes inthe context. Thus, while a 100 amino acid protein can contain only 2,000different mutations, 20¹⁰⁰ sequence combinations are possible.

Information density is the IC per unit length of a sequence. Activesites of enzymes tend to have a high information density. By contrast,flexible linkers of information in enzymes have a low informationdensity.

Current methods in widespread use for creating alternative proteins in alibrary format are error-prone polymerase chain reactions and cassettemutagenesis, in which the specific region to be optimized is replacedwith a synthetically mutagenized oligonucleotide. In both cases, asubstantial number of mutant sites are generated around certain sites inthe original sequence.

Error-prone PCR uses low-fidelity polymerization conditions to introducea low level of point mutations randomly over a long sequence. In amixture of fragments of unknown sequence, error-prone PCR can be used tomutagenize the mixture. The published error-prone PCR protocols sufferfrom a low processivity of the polymerase. Therefore, the protocol isunable to result in the random mutagenesis of an average-sized gene.This inability limits the practical application of error-prone PCR. Somecomputer simulations have suggested that point mutagenesis alone mayoften be too gradual to allow the large-scale block changes that arerequired for continued and dramatic sequence evolution. Further, thepublished error-prone PCR protocols do not allow for amplification ofDNA fragments greater than 0.5 to 1.0 kb, limiting their practicalapplication. In addition, repeated cycles of error-prone PCR can lead toan accumulation of neutral mutations with undesired results, such asaffecting a protein's immunogenicity but not its binding affinity.

In oligonucleotide-directed mutagenesis, a short sequence is replacedwith a synthetically mutagenized oligonucleotide. This approach does notgenerate combinations of distant mutations and is thus notcombinatorial. The limited library size relative to the vast sequencelength means that many rounds of selection are unavoidable for proteinoptimization. Mutagenesis with synthetic oligonucleotides requiressequencing of individual clones after each selection round followed bygrouping them into families, arbitrarily choosing a single family, andreducing it to a consensus motif. Such motif is re-synthesized andreinserted into a single gene followed by additional selection. Thisstep process constitutes a statistical bottleneck, is labor intensive,and is not practical for many rounds of mutagenesis.

Error-prone PCR and oligonucleotide-directed mutagenesis are thus usefulfor single cycles of sequence fine-tuning, but rapidly become toolimiting when they are applied for multiple cycles.

Another limitation of error-prone PCR is that the rate of down-mutationsgrows with the information content of the sequence. As the informationcontent, library size, and mutagenesis rate increase, the balance ofdown-mutations to up-mutations will statistically prevent the selectionof further improvements (statistical ceiling).

In cassette mutagenesis, a sequence block of a single template istypically replaced by a (partially) randomized sequence. Therefore, themaximum information content that can be obtained is statisticallylimited by the number of random sequences (i.e., library size). Thiseliminates other sequence families which are not currently best, butwhich may have greater long term potential.

Also, mutagenesis with synthetic oligonucleotides requires sequencing ofindividual clones after each selection round. Thus, such an approach istedious and impractical for many rounds of mutagenesis.

Thus, error-prone PCR and cassette mutagenesis are best suited, and havebeen widely used, for fine-tuning areas of comparatively low informationcontent. One apparent exception is the selection of an RNA ligaseribozyme from a random library using many rounds of amplification byerror-prone PCR and selection.

In nature, the evolution of most organisms occurs by natural selectionand sexual reproduction. Sexual reproduction ensures mixing andcombining of the genes in the offspring of the selected individuals.During meiosis, homologous chromosomes from the parents line up with oneanother and cross-over part way along their length, thus randomlyswapping genetic material. Such swapping or shuffling of the DNA allowsorganisms to evolve more rapidly.

In recombination, because the inserted sequences were of proven utilityin a homologous environment, the inserted sequences are likely to stillhave substantial information content once they are inserted into the newsequence.

Theoretically there are 2,000 different single mutants of a 100 aminoacid protein. However, a protein of 100 amino acids has 20¹⁰⁰ possiblesequence combinations, a number which is too large to exhaustivelyexplore by conventional methods. It would be advantageous to develop asystem which would allow generation and screening of all of thesepossible combination mutations.

Some workers in the art have utilized an in vivo site specificrecombination system to generate hybrids of combine light chain antibodygenes with heavy chain antibody genes for expression in a phage system.However, their system relies on specific sites of recombination and islimited accordingly. Simultaneous mutagenesis of antibody CDR regions insingle chain antibodies (scFv) by overlapping extension and PCR havebeen reported.

Others have described a method for generating a large population ofmultiple hybrids using random in vivo recombination. This methodrequires the recombination of two different libraries of plasmids, eachlibrary having a different selectable marker. The method is limited to afinite number of recombinations equal to the number of selectablemarkers existing, and produces a concomitant linear increase in thenumber of marker genes linked to the selected sequence(s).

In vivo recombination between two homologous, but truncated,insect-toxin genes on a plasmid has been reported as a method ofproducing a hybrid gene. The in vivo recombination of substantiallymismatched DNA sequences in a host cell having defective mismatch repairenzymes, resulting in hybrid molecule formation has been reported.

C—SUMMARY OF THE INVENTION

This invention relates generally to the field of cellular and wholeorganism engineering. Specifically, this invention relates to a cellulartransformation, directed evolution, and screening method for creatingnovel transgenic organisms having desirable properties. Thus in oneaspect, this invention relates to a method of generating a transgenicorganism, such as a microbe or a plant, having a plurality of traitsthat are differentially activatable.

In one embodiment, this invention is directed to a method of producingan improved organism having a desirable trait to by: a) obtaining aninitial population of organisms, b) generating a set of mutagenizedorganisms, such that when all the genetic mutations in the set ofmutagenized organisms are taken as a whole, there is represented a setof substantial genetic mutations, and c) detecting the presence of saidimproved organism. This invention provides that any of steps a), b), andc) can be further repeated in any particular order and any number oftimes; accordingly, this invention specifically provides methodscomprised of any iterative combination of steps a), b), and c), with anumber of iterations.

In another embodiment, this invention is directed to a method ofproducing an improved organism having a desirable trait to by: a)obtaining an initial population of organisms, which can be a clonalpopulation or otherwise, b) generating a set of mutagenized organismseach having at least one genetic mutation, such that when all thegenetic mutations in the set of mutagenized organisms are taken as awhole, there is represented a set of substantial genetic mutations c)detecting the manifestation of at least two genetic mutations, and d)introducing at least two detected genetic mutations into one organism.Additionally, this invention provides that any of steps a), b), c), andd) can be further repeated in any particular order and any number oftimes; accordingly, this invention specifically provides methodscomprised of any iterative combination of steps a), b), c), and d), witha total number of iterations can be from one up to one million,including specifically every integer value in between.

In a preferred aspect of embodiments specified herein the step of b)generating a second set of mutagenized organisms is comprised ofgenerating a plurality of organisms, each of which organisms has aparticular transgenic mutation.

As used herein, “generating a set of mutagenized organisms havinggenetic mutations” can be achieved by any means known in the art tomutagenized including any radiation known to mutagenized, such asionizing and ultra violet. Further examples of serviceable mutagenizingmethods include site-saturation mutagenesis, transposon-based methods,and homologous recombination.

“Combining” means incorporating a plurality of different geneticmutations in the genetic makeup (e.g. the genome) of the same organism;and methods to achieve this combining” step including sexualrecombination, homologous recombination, and transposon-based methods.

As used herein, an “initial population of organisms” means a “Workingpopulation of organisms”, which refers simply to a population oforganisms with which one is working, and which is comprised of at leastone organism. An “initial population of organisms” which can be a clonalpopulation or otherwise.

Accordingly, in step 1) an “initial population of organisms” may be apopulation of multicellular organisms or of unicellular organisms or ofboth. An “initial population of organisms” may be comprised ofunicellular organisms or multicellular organisms or both. An “initialpopulation of organisms” may be comprised of prokaryotic organisms oreukaryotic organisms or both. This invention provides that an “initialpopulation of organisms” is comprised of at least one organism, andpreferred embodiments include at least that.

By “organism” is meant any biological form or thing that is capable ofself replication or replication in a host. Examples of “organisms”include the following kinds of organisms (which kinds are notnecessarily mutually-exclusive): animals, plants, insects,cyanobacteria, microorganisms, fungi, bacteria, eukaryotes, prokaryotes,mycoplasma, viral organisms (including DNA viruses, RNA viruses), andprions.

Non-limiting particularly preferred examples of kinds of “organisms”also include Archaea (archaebacteria) and Bacteria (eubacteria).Non-limiting examples of Archaea (archaebacteria) include Crenarchaeota,Euryarchaeota, and Korarchaeota. Non-limiting examples Bacteria(eubacteria) include Aquificales, CFB/Green sulfur bacteria group,Chlamydiales/Verrucomicrobia group, Chrysiogenes group,Coprothermobacter group, Cyanobacteria & chloroplasts,Cytophaga/Flexibacter/Bacteriods group, Dictyoglomus group,Fibrobacter/Acidobacteria group, Firmicutes, Flexistipes group,Fusobacteria, Green non-sulfur bacteria, Nitrospira group,Planctomycetales, Proteobacteria, Spirochaetales, Synergistes group,Thermodesulfobacterium group, Thermotogales, Thermus/Deinococcus group.As non-limiting examples, particularly preferred kinds of organismsinclude Aquifex, Aspergillus, Bacillus, Clostridium, E. coli,Lactobacillus, Mycobacterium, Pseudomonas, Streptomyces, and Thermotoga.As additional non-limiting examples, particularly preferred organismsinclude cultivated organisms such as CHO, VERO, BHK, HeLa, COS, MDCK,Jurkat, HEK-293, and WI38. Particularly preferred non-limiting examplesof organisms further include host organisms that are serviceable for theexpression of recombinant molecules. Organisms further include primarycultures (e.g. cells from harvested mammalian tissues), immortalizedcells, all cultivated and culturable cells and multicellular organisms,and all uncultivated and uculturable cells and multicellular organisms.

In a preferred embodiment, knowledge of genomic information is usefulfor performing the claimed methods; thus, this invention provides thefollowing as preferred but non-limiting examples of organisms that areparticularly serviceable for this invention, because there is asignificant amount of—if not complete—genomic sequence information (interms of primary sequence &/or annotation) for these organisms: Human,Insect (e.g. Drosophila melanogaster), Higher plants (e.g. Arabidopsisthaliana), Protozoan (e.g. Plasmodium falciparum), Nematode (e.g.Caenorhabditis elegans), Fungi (e.g. Saccharomyces cerevisiae),Proteobacteria gamma subdivision (e.g. Escherichia coli K-12,Haemophilus influenzae Rd, Xylella fastidiosa 9a5c, Vibrio cholerae E1Tor N16961, Pseudomonas aeruginosa PA01, Buchnera sp. APS),Proteobacteria beta subdivision (e.g. Neisseria meningitidis MC58(serogroup B), Neisseria meningitidis Z2491 (serogroup A)),Proteobacteria other subdivisions (e.g. Helicobacter pylori 26695,Helicobacter pylori J99, Campylobacter jejuni NCTCI 11168, Rickettsiaprowazekii), Gram-positive bacteria (e.g. Bacillus subtilis, Mycoplasmagenitalium, Mycoplasma pneumoniae, Ureaplasma urealyticum, Mycobacteriumtuberculosis H37Rv), Chlamydia (e.g. Chlamydia trachomatisserovar D,Chlamydia muridarum (Chlamydia trachomatis MoPn), Chlamydia pneumoniaeCWL029, Chlamydia pneumoniae AR39, Chlamydia pneumoniae J138),Spirochete (e.g. Borrelia burgdorferi B31, Treponema pallidum),Cyanobacteria (e.g. Synechocystis sp. PCC6803), Radioresistant bacteria(e.g. Deinococcus radiodurans R1), Hyperthermophilic bacteria (e.g.Aquifex aeolicus VF5, Thermotoga marilima MSB8), and Archaea (e.g.Methanococcus jannaschii, Methanobacterium thermoautotrophicum deltaH,Archaeoglobus fulgidus, Pyrococcus horikoshii OT3, Pyrococcus abyssi,Aeropyrum pernix K1).

Non-limiting particularly preferred examples of kinds of plant“organisms” include those listed in Table 1. TABLE 1 Non-limitingexamples of plant organisms and sources of transgenic molecules (e.g.nucleic acids & nucleic acid products) 1. Alfalfa 2. Amelanchier laevis3. Apple 4. Arab. thaliana 5. Arabidopsis 6. Aspergillus flavus 7.Barley 8. Beet 9. Belladonna 10. Brassica oleracea 11. Carrot 12.Chrysanthemum 13. Cichorium intybus 14. Clavibacter 15. Clavibacter xyli16. Coffee 17. Corn 18. Cotton 19. Cranberry 20. Creeping bentgrass 21.Cryphonectria parasitica 22. Eggplant 23. Festuca arundinacea 24.Fusarium graminearum 25. Fusarium moniliforme 26. Fusariumsporotrichioides 27. Gladiolus 28. Grape 29. Heterorhabditisbacteriophora 30. Kentucky bluegrass 31. Lettuce 32. Melon 33. Oat 34.Onion 35. Papaya 36. Pea 37. Peanut 38. Pelargonium 39. Pepper 40.Persimmon 41. Petunia 42. Pine 43. Pineapple 44. Pink bollworm 45. Plum46. Poplar 47. Potato 48. Pseudomonas 49. Pseudomonas putida 50.Pseudomonas syringae 51. Rapeseed 52. Rhizobium 53. Rhizobium etli 54.Rhizobium fredii 55. Rhizobium leguminosarum 56. Rhizobium meliloti 57.Rice 58. Rubus idaeus 59. Spruce 60. Soybean 61. Squash 62.Squash-cucumber 63. Squash-cucurbita texana 64. Strawberry 65. Sugarcane66. Sunflower 67. Sweet potato 68. Sweetgum 69. TMV 70. Tobacco 71.Tomato 72. Walnut 73. Watermelon 74. Wheat 75. Xanthomonas 76.Xanthomonas campestris

As used herein, the meaning of “generating a set of mutagenizedorganisms having genetic mutations” includes the steps of substituting,deleting, as well as introducing a nucleotide sequence into organism;and this invention provides a nucleotide sequence that serviceable forthis purpose may be a single-stranded or double-stranded and the factthat its length may be from one nucleotide up to 10,000,000,000nucleotides in length including specifically every integer value inbetween.

A mutation in an organism includes any alteration in the structure ofone or more molecules that encode the organism. These molecules includenucleic acid, DNA, RNA, prionic molecules, and may be exemplified by avariety of molecules in an organism such as a DNA that is genomic,episomal, or nucleic, or by a nucleic acid that is vectoral (e.g. viral,cosmid, phage, phagemid).

In one aspect, as used herein, a “set of substantial genetic mutations”is preferably a disruption (e.g. a functional knock-out) of at leastabout 15 to about 150,000 genomic locations or nucleotide sequences(e.g. genes, promoters, regulatory sequences, codons etc.), includingspecifically every integer value in between. In another aspect, as usedherein, a “set of substantial genetic mutations” is preferably analteration in an expression level (e.g. decreased or increasedexpression level) or an alteration in the expression pattern (e.g.throughout a period of time) of at least about 15 to about 150,000genes, including specifically every integer value in between.Corresponding to another aspect, as used herein, a “set of substantialgenetic mutations” is preferably an alteration in an expression level(e.g. decreased or increased expression level) or an alteration in theexpression pattern (e.g. throughout a period of time) of at least about15 to about 150,000 gene products &/or phenotypes &/or traits, includingspecifically every integer value in between.

In another aspect, as used herein, a “set of substantial geneticmutations” with respect to an organism (or type of organism) ispreferably a disruption (e.g. a functional knock-out) of at least about1% to about 100% of genomic locations or nucleotide sequences (e.g.genes, promoters, regulatory sequences, codons etc.) in the organism (ortype of organism), including specifically percentages of every integervalue in between. In another aspect, as used herein, a “set ofsubstantial genetic mutations” is preferably an alteration in anexpression level (e.g. decreased or increased expression level) or analteration in the expression pattern (e.g. throughout a period of time)of at least about 1% to about 100% of genes in an organism (or type oforganism), including specifically percentages of every integer value inbetween. Corresponding to another aspect, as used herein, a “set ofsubstantial genetic mutations” is preferably an alteration in anexpression level (e.g. decreased or increased expression level) or analteration in the expression pattern (e.g. throughout a period of time)of at least about 1% to about 100% of the gene products &/or phenotypes&/or traits of an organism (or type of organism), including specificallyevery integer value in between.

In yet another aspect, as used herein, a “set of substantial geneticmutations” is preferably an introduction or deletion of at least about15 to 150,000 genes promoters or other nucleotide sequences (where eachsequence is from 1 base to 10,000,000 bases), including specificallyevery integer value in between. For example, one can introduce a libraryof at least about 15 to 150,000 nucleotides (genes or promoters)produced by “site-saturation mutagenesis” &/or by “ligation reassembly”(including any specific aspect thereof provided herein) into an “initialpopulation of organisms”.

It is provided that wherever the manipulation of a plurality of “genes”is mentioned herein, gene pathways (e.g. that ultimately lead to theproduction of small molecules) are also included. It is appreciatedherein that knocking-out, altering expression level, and alteringexpression pattern can be achieved, by non-limiting exemplification, bymutagenizing a nucleotide sequence corresponding gene as well as acorresponding promoter that affects the expression of the gene.

As used herein, a “mutagenized organism” includes any organism that hasbeen altered by a genetic mutation.

A “genetic mutation” can be, by way of non-limiting and non-mutuallyexclusive exemplification, and change in the nucleotide sequence (DNA orRNA) with respect to genomic, extra-genomic, episomal, mitochondrial,and any nucleotide sequence associated with (e.g. contained within orconsidered part of) an organism.

According to this invention, detecting the manifestation of a “geneticmutation” means “detecting the manifestation of a detectable parameter”,including but not limited to a change in the genomic sequence.Accordingly, this invention provides that a step of sequencing (&/orannotating) of and organism's genomic DNA is necessary for some methodsof this invention, and exemplary but non-limiting aspects of thissequencing (&/or annotating) step are provided herein.

A detectable “trait”, as used herein, is any detectable parameterassociated with the organism. Accordingly, such a detectable “parameter”includes, by way of non-limiting exemplification, any detectable“nucleotide knock-in”, any detectable “nucleotide knock-outs”, anydetectable “phenotype”, and any detectable “genotype”. By way of furtherillustration, a “trait” includes any substance produced or not producedby the organism. Accordingly, a “trait” includes viability ornon-viability, behavior, growth rate, size, morphology. “Trait” includesincreased (or alternatively decreased) expression of a gene product orgene pathway product. “Trait” also includes small molecule production(including vitamins, antibiotics), herbicide resistance, droughtresistance, pest resistance, production of any recombinant biomolecule(ie.g. vaccines, enzymes, protein therapeutics, chiral enzymes).Additional examples of serviceable traits for this invention are shownin Table 2. TABLE 2 Non-limiting examples of serviceable genes, geneproducts, phenotypes, or traits according to the methods of thisinvention (e.g. knockouts, knockins, increased or decreased expressionlevel, increased or decreased expression pattern) Table 2 - Part 1.Non-limiting examples of genes or gene products 1. 17 kDa protein 2.3-hydroxy-3-methylglutaryl CoenzymeA reductase 3. 4-Coumarate: CoAligase knockout 4. 60 kDa protein 5. Ac transposable element 6. ACCdeaminase 7. ACC oxidase knockout 8. ACC synthase 9. ACC synthaseknockout 10. Acetohydroxyacid synthase variant 11. Acetolactate synthase12. Acetyl CoA carboxylase 13. ACP acyl-ACP thioesterase 14. ACPthioesterase 15. Acyl CoA reductase 16. Acyl-ACP knockout 17. Acyl-ACPdesaturase 18. Acyl-ACP desaturase knockout 19. Acyl-ACP thioesterase20. ADP glucose pyrophosphorylase 21. ADP glucose pyrophosphorylaseknockout 22. Agglutinin 23. Aleurone 1 24. Alpha hordothinonin 25.Alpha-amylase 26. Alpha-hemoglobin 27. Aminoglycoside3′-adenylytransferase 28. Amylase 29. Anionic peroxidase 30. Antibody31. Antifungal protein 32. Antithrombin 33. Antitrypsin 34. Antiviralprotein 35. Aspartokinase 36. Attacin E 37. B1 regulatory gene 38.B-1,3-glucanase knockout 39. B-1,4-endoglucanase knockout 40.Bacteropsin 41. Barnase 42. Barstar 43. Beta-hemoglobin 44.B-glucuronidase 45. C1 knockout 46. C1 regulatory gene 47. C2 knockout48. C3 knockout 49. Caffeate O-methylthransferase 50. CaffeateO-methyltransferase knockout 51. Caffeoyl CoA O-methyltransferaseknockout 52. Casein 53. Cecropin 54. Cecropin B 55. Cellulose bindingprotein 56. Chalcone synthase knockout 57. Chitinase 58. Chitobiosidase59. Chloramphenicol acetyltransferase 60. Cholera toxin B 61. Cholineoxidase 62. Cinnamate 4-hydroxylase 63. Cinnamate 4-hydroxylase knockout64. Coat protein 65. Coat protein knockout 66. Conglycinin 67. CryIA 68.CryIAb 69. CryIAc 70. CryIB 71. CryIIA 72. CryIIIA 73. CryVIA 74. Cyclindependent kinase 75. Cyclodexlrin glycosyltransferase 76. Cylindricalinclusion protein 77. Cystathionine synthase 78. Delta-12 desaturase 79.Delta-12 desaturase knockout 80. Delta-12 saturase 81. Delta-12 saturaseknockout 82. Delta-15 desaturase 83. Delta-15 desaturase knockout 84.Delta-9 desaturase 85. Delta-9 desturase knockout 86. Deoxyhypusinesynthase (DHS) 87. Deoxyhypusine synthase knockout 88. Diacylglycerolacetyl tansferase 89. Dihydrodipicolinate synthase 90. Dihydrofolatereductase 91. Diptheria toxin A 92. Disease resistance response gene 4993. Double stranded ribonuclease 94. Ds transposable element 95.Elongase 96. EPSPS 97. Ethylene forming enzyme knockout 98. Ethylenereceptor protein 99. Ethylene receptor protein knockout 100. Fatty acidelongase 101. Fluorescent protein 102. G glycoprotein 103. Galactanase104. Galanthus nivalis agglutinin 105. Genome-linked protein 106.Glucanase 107. Glucanase knockout 108. Glucose oxidase 109. Glutamatedehydrogenase 110. Glutamine binding protein 111. Glutamine synthetase112. Glutenin 113. Glycerol-3-phosphate acetyl transferase 114.Glyphosate exidoreductase 115. Glyphosate oxidoreductase 116. Greenfluorescent protein 117. Helper component 118. Hemicellulase 119. Huplocus 120. Hygromycin phosphotransferase 121. Hyoscamine 6B-hydroxylase122. IAA monooxygenase 123. Invertase 124. Invertase knockout 125.Isopentenyl transferase 126. Ketoacyl-ACP synthase 127. Ketoacyl-ACPsynthase knockout 128. Larval serum protein 129. Leafy homeoticregulatory gene 130. Lectin 131. Lignin peroxidase 132. Luciferase 133.Lysine-2 gene 134. Lysophosphatidic acid acetyl transferase 135.Lysozyme 136. Mabinlin 137. Male sterility protein 138. Metallothionein139. Modified ethylene receptor protein 140. Modified ethylene receptorprotein knockout 141. Monooxygenase 142. Movement protein 143. Movementprotein nonfunctional 144. N gene for TMV resistance 145. N-acetylglucosidase 146. Nitrilase 147. Nopaline synthase 148. Notch 149. NptII150. Nuclear inclusion protein a 151. Nuclear inclusion protein b 152.Nucleocapsid 153. Nucleoprotein 154. O-acyl transferase 155. Oleayl-ACPthioesterase 156. Omega 3 desaturase 157. Omega 3 desaturease knockout158. Omega 6 desaturase 159. Omega 6 desaturase knockout 160.O-methyltransferase 161. Osmotin 162. Oxalate oxidase 163. Par locus164. Pathogenesis protein 1a 165. Pectate lyase 166. Pectin esterase167. Pectin esterase knockout 168. Pectin methylesterase 169. Pectinmethylesterase knockout 170. Pentenlypyrophosphate isomerase 171.Phosphinothricin 172. Phosphinothricin acetyl transferase 173.Phytochrome A 174. Phytoene synthase 175. Phleomycin binding protein176. Polygalacturonase 177. Polygalacturonase knockout 178.Polygalacturonase inhibitor protein 179. Prf regulatory gene 180.Prosystemin 181. Protease 182. Protein A 183. Protein kinase 184.Proteinase inhibitor 1 185. Pti5 transcription factor 186. R regulatorygene 187. Receptor kinase 188. Recombinase 189. Reductase 190. Replicase191. Resveratrol synthase 192. Ribonuclease 193. ro1c 194. Rol hormonegene 195. S-adenosylmethione decarboxylase 196. S-adenosylmethionehydrolase 197. S-adenosylmethionine transferase 198. Salicylatehydroxylase 199. Satellite RNA 200. Seed storage protein 201.Serine-threonine protein kinase 202. Serum albumin 203. Shrunken 2 204.Sorbitol dehydrogenase 205. Sorbitol synthase 206. Stilbene synthase207. Storage protein 208. Sucrose phosphate synthase 209. Systemicacquired resistance gene 8.2 210. Tetracycline binding protein 211.Thioesterase (×2) 212. Thiolase 213. TobRB7 214. Transcriptionalactivator 215. Transposon Tn5 216. Trehalase 217. Trehalase knockout218. Trichodiene synthase 219. Trichosanthin 220. Trifolitoxin 221.Trypsin inhibitor 222. T-URF13 mitochondrial 223. UDP glucoseglucosyltransferase 224. Violaxanthin de-epoxidase 225. Violaxanthinde-epoxidase knockout 226. Wheat germ agglutinin 227.Xanthosine-N7-methyltransferase knockout 228. Zein storage protein Table2 - Part 2. Non-limiting examples of input traits/phenotypes 1. 2,4-Dtolerant 2. Alernaria resistant 3. Altered amino acid composition 4.Alternaria solani resistant 5. Ammonium assimilation increased 6. AMVresistant 7. Aphid resistant 8. Apple scab resistant 9. Aspergillusresistant 10. B-1,4-endoglucanase 11. Bacterial leaf blight resistant12. Bacterial speck resistant 13. BCTV resistant 14. Blackspot bruiseresistant 15. BLRV resistant 16. BNYVV Resistant 17. Botrytis cinerearesistant 18. Botrytis resistant 19. BPMV resistant 20. Bromoxyniltolerant 21. BYDV resistant 22. BYMV resistant 23. Carbohydratemetabolism altered 24. Cell wall altered 25. Chlorsulfuron tolerant 26.Clavibacter resistant 27. CLRV resistant 28. CMV resistant 29. Coldtolerant 30. Coleopteran resistant 31. Colletotrichum resistant 32.Colorado potato beetle resistant 33. Constitutive expression ofglutamine synthetase 34. Corynebacterium sepedonicum resistant 35.Cottonwood leaf beetle resistant 36. Crown gall resistant 37. Crown rotresistant 38. Cucumovirus resistant 39. Cutting rootability increased40. Downy mildew resistant 41. Drought tolerant 42. Erwinia carotovoraresistant 43. Ethylene production reduced 44. European Corn Borerresistant 45. Female sterile 46. Fenthion susceptible 47. Fertilityaltered 48. Fire blight resistant 49. Flower and fruit abscissionreduced 50. Flower and fruit set altered 51. Flowering altered 52.Flowering time altered 53. Frogeye leaf spot resistant 54. Fruitripening altered 55. Fruit ripening delayed 56. Fruit rot resistant 57.Fruit solids increased 58. Fruit sweetness increased 59. Fungalpost-harvest resistant 60. Fungal resistant 61. Fungal resistant general62. Fusarium resistant 63. Glyphosate tolerant 64. Growth rate altered65. Growth rate reduced 66. Heat stable glucanase produced 67.Hordothionin produced 68. Imidazolinone tolerant 69. Insect resistantgeneral 70. Kanamycin resistant 71. Lepidopteran resistant 72. Lessercornstalk borer resistant 73. LMV resistant 74. Loss of systemicresistance 75. Male sterile 76. Marssonina resistant 77. MCDV resistant78. MCMV resistant 79. MDMV resistant 80. MDMV-B resistant 81. Mealybugwilt virus resistant 82. Melamtsora resistant 83. Melodgyne resistant84. Methotrexate resistant 85. Mexican Rice Borer resistant 86.Nucleocapsid protein produced 87. Oblique banded leafroller resistant88. PEMV resistant 89. PeSV resistant 90. Phoma resistant 91.Phosphinothricin tolerant 92. Phratora leaf beetle resistant 93.Phytophthora resistant 94. PLRV resistant 95. Polyamine metabolismaltered 96. Potyvirus resistant 97. Powdery mildew resistant 98. PPVresistant 99. Pratylenchus vulnus resistant 100. Proteinase inhibitorslevel constitutive 101. PRSV resistant 102. PRV resistant 103. PSbMVresistant 104. Pseudomonas syringae resistant 105. PStV resistant 106.PVX resistant 107. PVY resistant 108. RBDV resistant 109. Rhizoctoniaresistant 110. Rhizoctonia solani resistant 111. Ring rot resistance112. Root-knot nematode resistant 113. SbMV resistant 114. Sclerotiniaresistant 115. SCMV resistant 116. SCYLV resistant 117. Secondarymetabolite increased 118. Seed set reduced 119. Selectable marker 120.Senescence altered 121. Septoria resistant 122. Shorter stems 123. Softrot fungal resistant 124. Soft rot resistant 125. SqMV resistant 126.SrMV resistant 127. Storage protein altered 128. Streptomyces scabiesresistant 129. Sulfonylurea tolerant 130. Tetracycline binding proteinproduced 131. TEV resistant 132. Thelaviopsis resistant 133. TMVresistant 134. Tobamovirus resistant 135. ToMoV resistant 136. ToMVresistant 137. Transposon activator 138. Transposon inserted 139. TRVresistant 140. TSWV resistant 141. TVMV resistant 142. TYLCV resistant143. Tyrosine level increased 144. Venturia resistant 145. Verticilliumdahliae resistant 146. Verticillium resistant 147. Visual marker 148.WMV2 resistant 149. WSMV resistant 150. Yield increased 151. ZYMVresistant Table 2 - Part 3. Non-limiting examples of outputtraits/phenotypes 1. ACC oxidase level decreased 2. Altered ligninbiosynthesis 3. B-1,4-endoglucanase 4. Botrytis resistant 5.Carbohydrate metabolism altered 6. Carotenoid content altered 7. Cellwall altered 8. CMV resistant 9. Coleopteran resistant 10. Dry mattercontent increased 11. Ethylene production reduced 12. Ethylene synthesisreduced 13. Fatty acid metabolism altered 14. Fire blight resistant 15.Flower and fruit abscission reduced 16. Flower and fruit set altered 17.Flowering time altered 18. Fruit firmness increased 19. Fruit pectinesterase levels decreased 20. Fruit ripening altered 21. Fruit ripeningdelayed 22. Fruit solids increased 23. Fruit sugar profile altered 24.Fruit sweetness increased 25. Glucuronidase expressing 26. Heat stableglucanase produced 27. Heavy metals sequestered 28. Hordothioninproduced 29. Improved fruit quality 30. Industrial enzyme produced 31.Lepidopteran resistant 32. Lysine level increased 33. Mealybug wiltvirus resistant 34. Methionine level increased 35. Nucleocapsid proteinproduced 36. Oil profile altered 37. Pectin esterase level reduced 38.Pharmaceutical proteins produced 39. Phosphinothricin tolerant 40.Phytoene synthase activity increased 41. Pigment metabolism altered 42.Polygalacturonase level reduced 43. Processing characteristics altered44. Prolonged shelf life 45. Protein altered 46. Protein quality altered47. PRSV resistant 48. Root-knot nematode resistant 49. Sclerotiniaresistant 50. Seed composition altered 51. Seed methionine storageincreased 52. Seed set reduced 53. Seed storage protein 54. Senescencealtered (e.g. Shelf life increased) 55. Shorter stems 56. Solidsincreased 57. SqMV resistant 58. Starch level increased 59. Starchmetabolism altered 60. Starch reduced 61. Sterols increased 62. Storageprotein altered 63. Sugar alcohol levels increased 64. Telracyclinebinding protein produced 65. Tyrosine level increased 66. Verticilliumresistant 67. Visual marker 68. WMV2 resistant 69. Yield increased 70.ZYMV resistant Table 2 - Part 4. Non-limiting examples oftraits/phenotypes with agronomic properties 1. ACC oxidase leveldecreased 2. Altered amino acid composition 3. Altered ligninbiosynthesis 4. Altered maturing 5. Altered plant development 6.Aluminum tolerant 7. Ammonium assimilation increased 8. Anthocyaninproduced in seed 9. B-1,4-endoglucanase 10. Calmodulin level altered 11.Carbohydrate metabolism altered 12. Carotenoid content altered 13. Cellwall altered 14. Cold tolerant 15. Constitutive expression of glutaminesynthetase 16. Cutting root ability increased 17. Development altered18. Drought tolerant 19. Dry matter content increased 20. Environmentalstress reduced 21. Ethylene metabolism altered 22. Ethylene productionreduced 23. Ethylene synthesis reduced 24. Fatty acid metabolism altered25. Female sterile 26. Fenthion susceptible 27. Fertility altered 28.Fiber quality altered 29. Flower and fruit abscission reduced 30. Flowerand fruit set altered 31. Flowering altered 32. Flower color altered 33.Flowering time altered 34. Fruit firmness increased 35. Fruit pectinesterase and levels decreased 36. Fruit polygalacturonase leveldecreased 37. Fruit ripening altered 38. Fruit ripening delayed 39.Fruit solids increased 40. Fruit sugar profile altered 41. Fruitsweetness increased 42. Glucuronidase expressing 43. Growth rate altered44. Growth rate increased 45. Growth rate reduced 46. Heat stableglucanase produced 47. Heat tolerant 48. Heavy metals sequestered 49.Hordothionin produced 50. Improved fruit quality 51. Increasedphosphorus 52. Increased stalk strength 53. Industrial enzyme produced54. Lignin levels decreased 55. Lipase expressed in seeds 56. Lysinelevel increased 57. Male sterile 58. Male sterile reversible 59.Methionine level increased 60. Modified growth characteristics 61.Mycotoxin degradation 62. Nitrogen metabolism altered 63. Nucleocapsidprotein produced 64. Oil profile altered 65. Oil quality altered 66.Oxidative stress tolerant 67. Pectin esterase level reduced 68.Pharmaceutical proteins produced 69. Photosynthesis enhanced 70.Phytoene synthase activity increased 71. Pigment metabolism altered 72.Polyamine metabolism altered 73. Polygalacturonase level reduced 74.Pratylenchus vulnus resistant 75. Processing characteristics altered 76.Prolonged shelf life 77. Protein altered 78. Protein lysine levelincreased 79. Protein quality altered 80. Proteinase inhibitors levelconstitutive 81. Salt tolerance increased 82. Seed composition altered83. Seed methionine storage increased 84. Seed set reduced 85.Selectable marker 86. Senescence altered 87. Shorter stems 88. Solidsincreased 89. Starch level increased 90. Starch metabolism altered 91.Starch reduced 92. Sterols increased 93. Storage protein altered 94.Stress tolerant 95. Sugar alcohol levels increased 96. Tetracyclinebinding protein produced 97. Thermostable protein produced 98.Transposon activator 99. Transposon inserted 100. Tyrosine levelincreased 101. Visual marker 102. Vivipary increased 103. Yieldincreased Table 2 - Part 5. Non-limiting examples of traits/phenotypeswith product quality properties 1. 2,4-D tolerant 2. ACC oxidase leveldecreased 3. Altered amino acid composition 4. Altered ligninbiosynthesis 5. Anthocyanin produced in seed 6. Antioxidant enzymeincreased 7. Auxin metabolism and increased tuber solids 8.B-1,4-endoglucanase 9. Blackspot bruise resistant 10. Brown spotresistant 11. Bruising reduced 12. Caffeine levels reduced 13.Carbohydrate metabolism altered 14. Carotenoid content altered 15. Cellwall altered 16. Cold tolerant 17. Delayed softening 18. Disulfidesreduced in endosperm 19. Dry matter content increased 20. Ear moldresistant 21. Ethylene production reduced 22. Ethylene synthesis reduced23. Extended flower life 24. Fatty acid metabolism altered 25. Fiberquality altered 26. Fiber strength altered 27. Flavor enhancer 28.Flower and fruit abscission reduced 29. Fruit firmness increased 30.Fruit invertase level decreased 31. Fruit polygalacturonase leveldecreased 32. Fruit ripening altered 33. Fruit ripening delayed 34.Fruit solids increased 35. Fruit sugar profile altered 36. Fruitsweetness increased 37. Glyphosate tolerant 38. Heat stable glucanaseproduced 39. Improved fruit quality 40. Increased phosphorus 41.Increased protein levels 42. Lignin levels decreased 43. Lysine levelincreased 44. Male sterile 45. Melanin produced in cotton fibers 46.Metabolism altered 47. Methionine level increased 48. Mycotoxindegradation 49. Mycotoxin production inhibited 50. Nicotine levelsreduced 51. Nitrogen metabolism altered 52. Novel protein produced 53.Nutritional quality altered 54. Oil profile altered 55. Oil qualityaltered 56. Pectin esterase level reduced 57. Photosynthesis enhanced58. Phytoene synthase activity increased 59. Pigment metabolism altered60. Polyamine metabolism altered 61. Polygalacturonase level reduced 62.Processing characteristics altered 63. Prolonged shelf life 64. Proteinaltered 65. Protein lysine level increased 66. Protein quality altered67. Proteinase inhibitors level constitutive 68. Rust resistant 69. Seedcomposition altered 70. Seed methionine storage increased 71. Seednumber increased 72. Seed quality altered 73. Seed set reduced 74. Seedweight increased 75. Senescence altered 76. Solids increased 77. Starchlevel increased 78. Starch metabolism altered 79. Starch reduced 80.Steroidal glycoalkaloids reduced 81. Sterols increased 82. Storageprotein altered 83. Sugar alcohol levels increased 84. Thermostableprotein produced 85. Tryptophan level increased 86. Tuber solidsincreased 87. Yield increased Table 2 - Part 6. Non-limiting examples oftraits/phenotypes with herbicide tolerance properties 1. 2,4-D tolerant2. Chloroacetanilide tolerant 3. Fertility altered 4. Protein altered 5.Lignin levels decreased 6. Methionine level increased 7. Bromoxyniltolerant 8. Metabolism altered 9. Imidazole tolerant 10. Imidazolinonetolerant 11. Sulfonylurea tolerant 12. Northern corn leaf blightresistant 13. Herbicide tolerant 14. Isoxazole tolerant 15.Chlorsulfuron tolerant 16. Glyphosate tolerant 17. Lepidopteranresistant 18. Phosphinothricin tolerant 19. Sulfonylurea tolerant Table2 - Part 7. Non-limiting examples of traits/phenotypes with pestresistance properties 1. Agrobacterium resistant - BR 2. Alternariaresistant - FR 3. Alternaria daucii resistant - FR 4. Alternaria solaniresistant - FR 5. AMV resistant - VR 6. Anthracnose resistant - FR 7.Aphid resistant - IR 8. Apple scab resistant - FR 9. Aspergillusresistant - FR 10. Bacterial leaf blight resistant - BR 11. Bacterialresistant - BR 12. Bacterial soft rot resistant - BR 13. Bacterial softrot resistant - VR 14. Bacterial speck resistant - BR 15. BCTVresistant - VR 16. Black shank resistant - FR 17. BLRV resistant - VR18. BNYVV resistant - VR 19. Botrytis cinerea resistant - FR 20.Botrytis resistant - FR 21. BPMV resistant - VR 22. Brown spotresistant - FR 23. BYDV resistant - VR 24. BYMV resistant - VR 25. CaMVresistant - VR 26. Cercospora resistant - FR 27. Clavibacter resistant -BR 28. Closteroviurs resistant - BR 29. CLRV resistant - VR 30. CMVresistant - FR 31. Coleopteran resistant - IR 32. Colletotrichumresistant - FR 33. Colorado potato beetle resistant - IR 34. Cornearworm resistant - IR 35. Corynebacterium sepedonicum resistant - BR36. Cottonwood leaf beetle resistant - IR 37. Criconnemella resistant -NR 38. Crown gall resistant - BR 39. Cucumovirus resistant - VR 40.Cylindrosporium resistant - FR 41. Disease resistant general - FR 42.Dollar spot resistant - FR 43. Downy mildew resistant - FR 44. Ear moldresistant - FR 45. Erwinia carotovora resistant - BR 46. European CornBorer resistant - IR 47. Eyespot resistant - FR 48. Fall armywormresistant - IR 49. Fire blight resistant - BR 50. Frogeye leaf spotresistanT - FR 51. Fruit rot resistant - FR 52. Fungal post-harvestresistant - FR 53. Fungal resistant - FR 54. Fungal resistant general -FR 55. Fusarium dehlae resistant - FR 56. Fusarium resistant - FR 57.Geminivirus resistant - VR 58. Gray lead spot resistant - FR 59.Helminthosporium resistant - FR 60. Hordothionin produced - BR 61.Insect predator resistant - IR 62. Insect resistant general - IR 63.Late blight resistant - FR 64. Leaf blight resistant - FR 65. Leaf spotresistant - FR 66. Lepidopteran resistant - IR 67. Lesser cornstalkborer resistant - IR 68. LMV resistant - VR 69. Loss of systemicresistance - VR 70. Marssonina resistant - FR 71. MCDV resistant - VR72. MCMV resistant - VR 73. MDMV resistant - VR 74. MDMV-B resistant -VR 75. Mealybug wilt virus resistant - VR 76. Melamtsora resistant - FR77. Melodgyne resistant - NR 78. Meloidogyne resistant - NR 79. MexicanRice Borer resistant - IR 80. Mycotoxin degradation - FR 81. Nepovirusresistant - VR 82. Northern corn leaf blight resistant - IR 83.Nucleocapsid protein produced - VR 84. Oblique banded leafrollerresistant - IR 85. Oomycete resistant - FR 86. Pathogenesis relatedproteins level increased - FR 87. PEMV resistant - VR 88. PeSVResistant - VR 89. Phatora leaf beetle resistant - IR 90. Phomaresistant - FR 91. Phytophthora resistant - FR 92. PLRV resistant - VR93. Potyvirus resistant - VR 94. Powdery mildew resistant - FR 95. PPVresistant - VR 96. Pralylenchus vulnus resistant - NR 97. PRSVresistant - VR 98. PRV resistant - VR 99. PSbMV resistant - VR 100.Pseudomonas syringae resistant - BR 101. PStV resistant - VR 102. PVXresistant - VR 103. PVY resistant - VR 104. RBDV resistant - VR 105.Rhizoctonia resistant - FR 106. Rhizoctonia solani resistant - FR 107.Ring rot resistance - BR 108. Root-knot nematode resistant - NR 109.Rust resistant - FR 110. SbMV resistant - VR 111. Sclerotiniaresistant - FR 112. SCMV resistant - VR 113. SCYLV resistant - VR 114.Septoria resistant - FR 115. Smut resistant - FR 116. SMV resistant - VR117. Sod web worm resistant - IR 118. Soft rot fungal resistant - FR119. Soft rot resistant - BR 120. Southwestern corn borer resistant- IR121. SPFMV resistant - VR 122. Sphaeropsis fruit rot resistant - FR 123.SqMV resistant - VR 124. SrMV resistant - VR 125. Streptomyces scabiesresistant - BR 126. Sugar cane borer resistant - IR 127. TEV resistant -VR 128. Thelaviopsis resistant - FR 129. TMV resistant - FR 130.Tobamovirus resistant - VR 131. ToMoV resistant - VR 132. ToMVresistant - VR 133. TRV resistant - VR 134. TSWV resistant - VR 135.TVMV resistant - VR 136. TYLCV resistant - VR 137. Venturia resistant -FR 138. Verticillium dahliae resistant - FR 139. Verticilliumresistant - FR 140. Western corn root worm resistant - IR 141. WMV2resistant - VR 142. WSMV resistant - VR 143. ZYMV resistant - VR Table2 - Part 8. Non-limiting examples of miscellaneous traits/ phenotypeswith properties 1. Antibiotic produced 2. Antiprotease producing 3.Capable of growth on defined synthetic media 4. Carbohydrate metabolismaltered 5. Cell wall altered 6. Cold tolerant 7. Coleopteran resistant8. Color altered 9. Color sectors in seeds 10. Colored sectors in leaves11. Constitutive expression of glutaminc synthetase 12. Cre recombinaseproduced 13. Dalapon tolerant 14. Development altered 15. Diseaseresistant general 16. Ethylene metabolism altered 17. Expressionoptimization 18. Fenthion susceptible 19. Glucuronidase expressing 20.Glyphosate tolerant 21. Growth rate reduced 22. Heavy metals sequestered23. Hygromycin tolerant 24. Inducible DNA modification 25. Industrialenzyme produced 26. Kanamycin resistant 27. Lipase expressed in seeds28. Methotrexate resistant 29. Modified growth characteristics 30.Mycotoxin deficient 31. Mycotoxin production inhibited 32. Mycotoxinrestored 33. Non-lesion forming mutant 34. Novel protein produced 35.Oil quality altered 36. Peroxidase levels increased 37. Pharmaceuticalproteins produced 38. Phosphinothricin tolerant 39. Pigment metabolismaltered 40. Pollen visual marker 41. Polyamine metablosim altered 42.Polymer produced 43. Recombinase produced 44. Secondary metaboliteincreased 45. Seed color altered 46. Seed weight increased 47.Selectable marker 48. Spectromycin resistant 49. Sterile 50. Sterolsincreased 51. Sulfonylurea susceptible 52. Syringomycin deficient 53.Transposon activator 54. Transposon elements inserted 55. Transposoninserted 56. Trifolitoxin producing 57. Trifolitoxin resistant 58.Virulence reduced 59. Visual marker 60. Visual marker inactiveLegendBR—Bacterial ResistantFR—Fungal ResistantIR—Insent ResistantNR—Nematode ResistantVR—Viral Resistant

In a particular examplification, “producing an organism having adesirable trait” includes an organism that is with respect to an organor a part of an organ but not necessarily altered anywhere else.

By “trait” is meant any detectable parameter associated with an organismunder a set of conditions. Examples of “detectable parameters” includethe ability to produce a substance, the ability to not produce asubstance, an altered pattern of (such as an increased or a decreased)ability to produce a substance, viability, non-viability, behaviour,growth rate, size, morphology or morphological characteristic,

In another embodiment, this invention is directed to a method ofproducing an organism having a desirable trait or a desirableimprovement in a trait by: a) obtaining an initial population oforganisms comprised of at least one starting organism, b) mutagenizingthe population such that mutations occur throughout a substantial partof the genome of at least one initial organism, c) selecting at leastone mutagenized organism having a desirable trait or a desirableimprovement in a trait, and d) optionally repeating the method bysubjecting one or more mutagenized organisms to a repetition of themethod. A mutagenized organism having a desirable trait or a desirableimprovement in a trait can be referred to as an “up-mutant”, and theassociated mutation(s) contained in an up-mutant organism can bereferred to as up-mutation(s).

In one embodiment, step c) is comprised of selecting at least twodifferent mutagenized organisms, each having a different mutagenizedgenome, and the method of producing an organism having a desirable traitor a desirable improvement in a trait is comprised of a) obtaining astarting population of organisms comprised of at least one startingorganism, b) mutagenizing the population such that mutations occurthroughout a substantial part of the genome of at least one startingorganism, c) selecting at least two mutagenized organism having adesirable trait or a desirable improvement in a trait, d) creatingcombinations of the mutations of the two or more mutagenized organisms,e) selecting at least one mutagenized organism having a desirable traitor a desirable improvement in a trait, and f) optionally repeating themethod by subjecting one or more mutagenized organisms to a repetitionof the method.

In one embodiment, the method is repeated. Thus, for example, anup-mutant organism can serve as a starting organism for the abovemethod. Also, for example, an up mutant organism having a combination oftwo or more up-mutations in its genome can serve as a starting organismfor the above method.

Thus, in one embodiment, this invention is directed to a method ofproducing an organism having a desirable trait or a desirableimprovement in a trait by: a) obtaining a starting population oforganisms comprised of at least one starting organism, b) mutagenizingthe population such that mutations occur throughout a substantial partof the genome of at least one starting organism, c) selecting at leastone mutagenized organism having a desirable trait or a desirableimprovement in a trait, and d) optionally repeating the method bysubjecting one or more mutagenized organisms to a repetition of themethod. A mutagenized organism having a desirable trait or a desirableimprovement in a trait can be referred to as an “up-mutant”, and theassociated mutation(s) contained in an up-mutant organism can bereferred to as up-mutation(s).

Mutagenizing a starting population such that mutations occur throughouta substantial part of the genome of at least one starting organismrefers to mutagenizing at least approximately 1% of the genes of agenome, or at least approximately 10% of the genes of a genome, or atleast approximately 20% of the genes of a genome, or at leastapproximately 30% of the genes of a genome, or at least approximately40% of the genes of a genome, or at least approximately 50% of the genesof a genome, or at least approximately 60% of the genes of a genome, orat least approximately 70% of the genes of a genome, or at leastapproximately 80% of the genes of a genome, or at least approximately90% of the genes of a genome, or at least approximately 95% of the genesof a genome, or at least approximately 98% of the genes of a genome.

In a particular embodiment, this invention provides a method ofproducing an organism having a desirable trait or a desirableimprovement in a trait by: a) obtaining sequence information of agenome; b) annotating the genomic sequence obtained; c) mutagenizing asubstantial part of the genome the genome; d) selecting at least onemutagenized genome having a desirable trait or a desirable improvementin a trait; and e) optionally repeating the method by subjecting one ormore mutagenized genomes to a repetition of the method.

Thus in one aspect, this invention provides a process comprised of:

1.) Subjecting a working cell or organism to holistic monitoring (whichcan include the detection and/or measurement of all detectable functionsand physical parameters). Examples of such parameters includemorphology, behavior, growth, responsiveness to stimuli (e.g.,antibiotics, different environment, etc.). Additional examples includeall measurable molecules, including molecules that are chemically atleast in part a nucleic acids, proteins, carbohydrates, proteoglycans,glycoproteins, or lipids. In a particular aspect, performing holisticmonitoring is comprised of using a microarray-based method. In anotheraspect, performing holistic monitoring is comprised of sequencing asubstantial portion of the genome, i.e. for example at leastapproximately 10% of the genome, or for example at least approximately20% of the genome, or for example at least approximately 30% of thegenome, or for example at least approximately 40% of the genome, or forexample at least approximately 50% of the genome, or for example atleast approximately 60% of the genome, or for example at leastapproximately 70% of the genome, or for example at least approximately80% of the genome, or for example at least approximately 90% of thegenome, or for example at least approximately 95% of the genome, or forexample at least approximately 98% of the genome.

2) Introducing into the working cell or organism a plurality of traits(stacked traits), including selectively and differentially activatabletraits. Serviceable traits for this purpose include traits conferred bygenes and traits conferred by gene pathways.

3) Subjecting the working cell or organism to holistic monitoring.

4) Compiling the information obtained from steps 1) and 3), andprocessing &/or analyzing it to better understand the changes introducedinto the working cell or organisms. Such data processing includesidentifying correlations between and/or among the measured parameters.

5) Repeating any number or all of steps 2), 3), and 4).

This invention provides that molecules serviceable for introducingtransgenic traits into a plant include all known genes and nucleicacids. By way of non-limiting exemplification, this inventionspecifically names any number &/or combination of genes listed herein orlisted in any reference incorporated herein by reference. Furthermore,by way of non-limiting exemplification, this invention specificallynames any number &/or combination of genes & gene pathways listed hereinas well as in any reference incorporated by reference herein. Thisinvention provides that molecules serviceable as detectable parametersinclude molecule, any enzyme, substrate thereof, product thereof, andany gene or gene pathway listed herein including in any figure or tableherein as well as in any reference incorporated by reference herein.

This invention also relates generally to the field of nucleic acidengineering and correspondingly encoded recombinant protein engineering.More particularly, the invention relates to the directed evolution ofnucleic acids and screening of clones containing the evolved nucleicacids for resultant activity(ies) of interest, such nucleic acidactivity(ies) &/or specified protein, particularly enzyme, activity(ies)of interest.

Mutagenized molecules provided by this invention may have chimericmolecules and molecules with point mutations, including biologicalmolecules that contain a carbohydrate, a lipid, a nucleic acid, &/or aprotein component, and specific but non-limiting examples of theseinclude antibiotics, antibodies, enzymes, and steroidal andnon-steroidal hormones.

This invention relates generally to a method of: 1) preparing a progenygeneration of molecule(s) (including a molecule that is comprised of apolynucleotide sequence, a molecule that is comprised of a polypeptidesequence, and a molecules that is comprised in part of a polynucleotidesequence and in part of a polypeptide sequence), that is mutagenized toachieve at least one point mutation, addition, deletion, &/orchimerization, from one or more ancestral or parental generationtemplate(s); 2) screening the progeny generation molecule(s)—preferablyusing a high throughput method—for at least one property of interest(such as an improvement in an enzyme activity or an increase instability or a novel chemotherapeutic effect); 3) optionally obtaining&/or cataloguing structural &/or and functional information regardingthe parental &/or progeny generation molecules; and 4) optionallyrepeating any of steps 1) to 3).

In a preferred embodiment, there is generated (e.g. from a parentpolynucleotide template)—in what is termed “codon site-saturationmutagenesis”—a progeny generation of polynucleotides, each having atleast one set of up to three contiguous point mutations (i.e. differentbases comprising a new codon), such that every codon (or every family ofdegenerate codons encoding the same amino acid) is represented at eachcodon position. Corresponding to—and encoded by—this progeny generationof polynucleotides, there is also generated a set of progenypolypeptides, each having at least one single amino acid point mutation.In a preferred aspect, there is generated—in what is termed “amino acidsite-saturation mutagenesis”—one such mutant polypeptide for each of the19 naturally encoded polypeptide-forming alpha-amino acid substitutionsat each and every amino acid position along the polypeptide. Thisyields—for each and every amino acid position along the parentalpolypeptide—a total of 20 distinct progeny polypeptides including theoriginal amino acid, or potentially more than 21 distinct progenypolypeptides if additional amino acids are used either instead of or inaddition to the 20 naturally encoded amino acids.

Thus, in another aspect, this approach is also serviceable forgenerating mutants containing—in addition to &/or in combination withthe 20 naturally encoded polypeptide-forming alpha-amino acids—otherrare &/or not naturally-encoded amino acids and amino acid derivatives.In yet another aspect, this approach is also serviceable for generatingmutants by the use of—in addition to &/or in combination with natural orunaltered codon recognition systems of suitable hosts—altered,mutagenized, &/or designer codon recognition systems (such as in a hostcell with one or more altered tRNA molecules).

In yet another aspect, this invention relates to recombination and morespecifically to a method for preparing polynucleotides encoding apolypeptide by a method of in vivo re-assortment of polynucleotidesequences containing regions of partial homology, assembling thepolynucleotides to form at least one polynucleotide and screening thepolynucleotides for the production of polypeptide(s) having a usefulproperty.

In yet another preferred embodiment, this invention is serviceable foranalyzing and cataloguing—with respect to any molecular property (e.g.an enzymatic activity) or combination of properties allowed by currenttechnology—the effects of any mutational change achieved (includingparticularly saturation mutagenesis). Thus, a comprehensive method isprovided for determining the effect of changing each amino acid in aparental polypeptide into each of at least 19 possible substitutions.This allows each amino acid in a parental polypeptide to becharacterized and catalogued according to its spectrum of potentialeffects on a measurable property of the polypeptide.

In another aspect, the method of the present invention utilizes thenatural property of cells to recombine molecules and/or to mediatereductive processes that reduce the complexity of sequences and extentof repeated or consecutive sequences possessing regions of homology.

It is an object of the present invention to provide a method forgenerating hybrid polynucleotides encoding biologically active hybridpolypeptides with enhanced activities. In accomplishing these and otherobjects, there has been provided, in accordance with one aspect of theinvention, a method for introducing polynucleotides into a suitable hostcell and growing the host cell under conditions that produce a hybridpolynucleotide.

In another aspect of the invention, the invention provides a method forscreening for biologically active hybrid polypeptides encoded by hybridpolynucleotides. The present method allows for the identification ofbiologically active hybrid polypeptides with enhanced biologicalactivities.

Other objects, features and advantages of the present invention willbecome apparent from the following detailed description. It should beunderstood, however, that the detailed description and the specificexamples, while indicating preferred embodiments of the invention, aregiven by way of illustration only, since various changes andmodifications within the spirit and scope of the invention will becomeapparent to those skilled in the art from this detailed description.

In yet another aspect, this invention relates to a method of discoveringwhich phenotype corresponds to a gene by disrupting every gene in theorganism.

Accordingly, this invention provides a method for determining a genethat alters a characteristic of an organism, comprising: a) obtaining aninitial population of organisms, b) generating a set of mutagenizedorganisms, such that when all the genetic mutations in the set ofmutagenized organisms are taken as a whole, there is represented a setof substantial genetic mutations, and c) detecting the presence anorganism having an altered trait, and d) determining the nucleotidesequence of a gene that has been mutagenized in the organism having thealtered trait.

In yet another aspect, this invention relates to a method of improving atrait in an organism by functionally knocking out a particular gene inthe organism, and then transferring a library of genes, which only varyfrom the wild-type at one codon position, into the organism.

Accordingly, this invention provides a method method for producing anorganism with an improved trait, comprising:

-   -   a) functionally knocking out an enogenous gene in a        substantially clonal population of organisms;    -   b) transferring the set of altered genes into the clonal        population of organisms, wherein each altered gene differs from        the endogenous gene at only one codon; and    -   c) detecting a mutagenized organism having an improved trait;        and    -   d) determining the nucleotide sequence of a gene that has been        transferred into the detected organism.

D. BRIEF DESCRIPTION OF THE DRAWINGS

FIG. 1. Exonuclease Activity. FIG. 1 shows the activity of the enzymeexonuclease III. This is an exemplary enzyme that can be used toshuffle, assemble, reassemble, recombine, and/or concatenatepolynucleotide building blocks. The asterisk indicates that the enzymeacts from the 3′ direction towards the 5′ direction of thepolynucleotide substrate.

FIG. 2. Generation of A Nucleic Acid Building Block by Polymerase-BasedAmplification. FIG. 2 illustrates a method of generating adouble-stranded nucleic acid building block with two overhangs using apolymerase-based amplification reaction (e.g., PCR). As illustrated, afirst polymerase-based amplification reaction using a first set ofprimers, F₂ and R₁, is used to generate a blunt-ended product (labeledReaction 1, Product 1), which is essentially identical to Product A. Asecond polymerase-based amplification reaction using a second set ofprimers, F₁ and R₂, is used to generate a blunt-ended product (labeledReaction 2, Product 2), which is essentially identical to Product B.These two products are then mixed and allowed to melt and anneal,generating a potentially useful double-stranded nucleic acid buildingblock with two overhangs. In the example of FIG. 1, the product with the3′ overhangs (Product C) is selected for by nuclease-based degradationof the other 3 products using a 3′ acting exonuclease, such asexonuclease III. Alternate primers are shown in parenthesis toillustrate serviceable primers may overlap, and additionally thatserviceable primers may be of different lengths, as shown.

FIG. 3. Unique Overhangs And Unique Couplings. FIG. 3 illustrates thepoint that the number of unique overhangs of each size (e.g. the totalnumber of unique overhangs composed of 1 or 2 or 3, etc. nucleotides)exceeds the number of unique couplings that can result from the use ofall the unique overhangs of that size. For example, there are 4 unique3′ overhangs composed of a single nucleotide, and 4 unique 5′ overhangscomposed of a single nucleotide. Yet the total number of uniquecouplings that can be made using all the 8 unique single-nucleotide 3′overhangs and single-nucleotide 5′ overhangs is 4.

FIG. 4. Unique Overall Assembly Order Achieved by Sequentially Couplingthe Building Blocks.

FIG. 4 illustrates the fact that in order to assemble a total of “n”nucleic acid building blocks, “n-1” couplings are needed. Yet it issometimes the case that the number of unique couplings available for useis fewer that the “n-1” value. Under these, and other, circumstances astringent non-stochastic overall assembly order can still be achieved byperforming the assembly process in sequential steps. In this example, 2sequential steps are used to achieve a designed overall assembly orderfor five nucleic acid building blocks. In this illustration the designedoverall assembly order for the five nucleic acid building blocks is:5′-(#1-#2-#3-#4-#5)-3′, where #1 represents building block number 1,etc.

FIG. 5. Unique Couplings Available Using a Two-Nucleotide 3′ Overhang.FIG. 5 further illustrates the point that the number of unique overhangsof each size (here, e.g. the total number of unique overhangs composedof 2 nucleotides) exceeds the number of unique couplings that can resultfrom the use of all the unique overhangs of that size. For example,there are 16 unique 3′ overhangs composed of two nucleotides, andanother 16 unique 5′ overhangs composed of two nucleotides, for a totalof 32 as shown. Yet the total number of couplings that are unique andnot self-binding that can be made using all the 32 uniquedouble-nucleotide 3′ overhangs and double-nucleotide 5′ overhangs is 12.Some apparently unique couplings have “identical twins” (marked in thesame shading), which are visually obvious in this illustration. Stillother overhangs contain nucleotide sequences that can self-bind in apalindromic fashion, as shown and labeled in this figure; thus they notcontribute the high stringency to the overall assembly order.

FIG. 6. Generation of an Exhaustive Set of Chimeric Combinations bySynthetic Ligation Reassembly. FIG. 6 showcases the power of thisinvention in its ability to generate exhaustively and systematically allpossible combinations of the nucleic acid building blocks designed inthis example. Particularly large sets (or libraries) of progeny chimericmolecules can be generated. Because this method can be performedexhaustively and systematically, the method application can be repeatedby choosing new demarcation points and with correspondingly newlydesigned nucleic acid building blocks, bypassing the burden ofre-generating and re-screening previously examined and rejectedmolecular species. It is appreciated that, codon wobble can be used toadvantage to increase the frequency of a demarcation point. In otherwords, a particular base can often be substituted into a nucleic acidbuilding block without altering the amino acid encoded by progenitorcodon (that is now altered codon) because of codon degeneracy. Asillustrated, demarcation points are chosen upon alignment of 8progenitor templates. Nucleic acid building blocks including theiroverhangs (which are serviceable for the formation of ordered couplings)are then designed and synthesized. In this instance, 18 nucleic acidbuilding blocks are generated based on the sequence of each of the 8progenitor templates, for a total of 144 nucleic acid building blocks(or double-stranded oligos). Performing the ligation synthesis procedurewill then produce a library of progeny molecules comprised of yield of8¹⁸ (or over 1.8×10¹⁶) chimeras.

FIG. 7. Synthetic genes from oligos: According to one embodiment of thisinvention, double-stranded nucleic acid building blocks are designed byaligning a plurality of progenitor nucleic acid templates. Preferablythese templates contain some homology and some heterology. The nucleicacids may encode related proteins, such as related enzymes, whichrelationship may be based on function or structure or both. FIG. 7 showsthe alignment of three polynucleotide progenitor templates and theselection of demarcation points (boxed) shared by all the progenitormolecules. In this particular example, the nucleic acid building blocksderived from each of the progenitor templates were chosen to beapproximately 30 to 50 nucleotides in length.

FIG. 8. Nucleic acid building blocks for synthetic ligation genereassembly. FIG. 8 shows the nucleic acid building blocks from theexample in FIG. 7. The nucleic acid building blocks are shown here ingeneric cartoon form, with their compatible overhangs, including both 5′and 3′ overhangs. There are 22 total nucleic acid building blocksderived from each of the 3 progenitor templates. Thus, the ligationsynthesis procedure can produce a library of progeny molecules comprisedof yield of 3²² (or over 3.1×10¹⁰) chimeras.

FIG. 9. Addition of Introns by Synthetic Ligation Reassembly. FIG. 9shows in generic cartoon form that an intron may be introduced into achimeric progeny molecule by way of a nucleic acid building block. It isappreciated that introns often have consensus sequences at both terminiin order to render them operational. It is also appreciated that, inaddition to enabling gene splicing, introns may serve an additionalpurpose by providing sites of homology to other nucleic acids to enablehomologous recombination. For this purpose, and potentially others, itmay be sometimes desirable to generate a large nucleic acid buildingblock for introducing an intron. If the size is overly large easilygenrating by direct chemical synthesis of two single stranded oligos,such a specialized nucleic acid building block may also be generated bydirect chemical synthesis of more than two single stranded oligos or byusing a polymerase-based amplification reaction as shown in FIG. 2.

FIG. 10. Ligation Reassembly Using Fewer Than All The Nucleotides Of AnOverhang. FIG. 10 shows that coupling can occur in a manner that doesnot make use of every nucleotide in a participating overhang. Thecoupling is particularly lively to survive (e.g. in a transformed host)if the coupling reinforced by treatment with a ligase enzyme to formwhat may be referred to as a “gap ligation” or a “gapped ligation”. Itis appreciated that, as shown, this type of coupling can contribute togeneration of unwanted background product(s), but it can also be usedadvantageously increase the diversity of the progeny library generatedby the designed ligation reassembly.

FIG. 11. Avoidance of unwanted self-ligation in palindromic couplings.As mentioned before and shown in FIG. 5, certain overhangs are able toundergo self-coupling to form a palindromic coupling. A coupling isstrengthened substantially if it is reinforced by treatment with aligase enzyme. Accordingly, it is appreciated that the lack of 5′phosphates on these overhangs, as shown, can be used advantageously toprevent this type of palindromic self-ligation. Accordingly, thisinvention provides that nucleic acid building blocks can be chemicallymade (or ordered) that lack a 5′ phosphate group (or alternatively theycan be remove—e.g. by treatment with a phosphatase enzyme such as a calfintestinal alkaline phosphatase (CIAP)— in order to prevent palindromicself-ligations in ligation reassembly processes.

FIG. 12. Pathway Engineering. It is a goal of this invention to provideways of making new gene pathways using ligation reassembly, optionallywith other directed evolution methods such as saturation mutagenesis.FIG. 12 illustrates a preferred approach that may be taken to achievethis goal. It is appreciated that naturally-occurring microbial genepathways are linked more often than naturally-occurring eukaryotic (e.g.plant) gene pathways, which are sometime only partially linked. In aparticular embodiment, this invention provides that regulatory genesequences (including promoters) can be introduced in the form of nucleicacid building blocks into progeny gene pathways generated by ligationreassembly processes. Thus, originally linked microbial gene pathways,as well as originally unlinked genes and gene pathways, can be thusconverted to acquire operability in plants and other eukaryotes.

FIG. 13. Avoidance of unwanted self-ligation in palindromic couplings.FIG. 13 illustrates that another goal of this invention, in addition tothe generation of novel gene pathways, is the subjection of genepathways—both naturally occurring and man-made—to mutagenesis andselection in order to achieve improved progeny molecules using theinstantly disclosed methods of directed evolution (including saturationmutagenesis and synthetic ligation reassembly). In a particularembodiment, as provided by the instant invention, both microbial andplant pathways can be improved by directed evolution, and as shown, thedirected evolution process can be performed both on genes prior tolinking them into pathways, and on gene pathways themselves.

FIG. 14. Conversion of Microbial Pathways to Eukaryotic Pathways. In aparticular embodiment, this invention provides that microbial pathwayscan be converted to pathways operable in plants and other eukaryoticspecies by the introduction of regulatory sequences that function inthose species. Preferred regulatory sequences include promoters,operators, and activator binding sites. As shown, a preferred method ofachieving the introduction of such serviceable regulatory sequences isin the form of nucleic acid building blocks, particularly through theuse of couplings in ligation reassembly processes. These couplings inFIG. 14 are marked with the letters A, B, C, D and F.

FIG. 15. Engineering of differentially activatable stacked traits innovel transgenic plants using directed evolution and holistic whole cellmonitoring. It is a goal of this invention to provide ways ofintroducing differentially activatable stacked traits into a transgeniccell or organism, the effects of which is holistically monitored. FIG.15 illustrates an approach that may be taken to introduce a plurality ofstacked traits into an organism, such as but not limited to a plant, andto carry out holistic whole cell or organism monitoring. Holisticmonitoring can include methods pertaining to genomics, RNA profiling,proteomics, metabolomics, and lipid profiling.

FIG. 16. Differential Activation of Selected Traits Can Be Achieved byAdjusting and Controlling the Environment of the Traits. In a particularembodiment, this invention provides that stacked traits can beintroduced into an organism that are differentially activatable,allowing screening under various conditions. FIG. 16 illustrates anexample in which the stacked traits comprise genetically introducedenzymes. In this example, the enzymes can be selectively anddifferentially activated by adjusting the environment to which they areexposed.

FIG. 17. Desired or improved traits for harvesting, processing, andstorage conditions. One of the goals of this invention is to provide amethod that allows the generation of recombinant proteins with desiredor improved activities. In a particular embodiment, as illustrated inthis figure, a potential application of this method is screeningtransgenic cells for various responses to harvesting, processing, andstorage conditions of biological reagents and strains. The transgeniccells have had stacked traits that are differentially activatableintroduced. Screening methods that pertain to methods of genomics,proteomics, RNA profiling, metabolomics, and lipid profiling can beutilized and assessed under various specific conditions that include butare not limited to variations in pH, temperature, and otherenvironmental conditions.

FIG. 18. Mutagenesis and production of a transgenic organism. In anotherembodiment of this invention, it provides a general method to introducea library of mutagenized nucleotide sequences (e.g., saturationmutagenesis and/or ligation reassembly) into an organism, and to screenthe transgenic organisms for various holistic phenotypes (preferablyusing a high throughput method). Optionally, mutations can be combinedand the organisms rescreened and/or a second library can be introducedinto the transgenic organisms and the process repeated. In a preferredembodiment, the starting population is comprised of an organism strainto be subjected to improvement or evolution in order to produce aresultant population comprised of an improved organism strain that has adesired trait.

FIG. 19 Gene Product Processing. FIG. 19 illustrates that variousprocessing or decorating steps occur to a gene product prior to it beingactive. This is a schematic of various processing steps that render aproduct active or inactive. Once a gene product is active it can bedifferentially expressed and in certain cases modifications in itsactivities or properties can be screened.

FIG. 20. Differential Activation of Selected Precursor (Inactive) GeneProducts. FIG. 20 is a schematic that illustrates post-translationalmodifications as a potential process that differentially activates geneproducts. Differential activation of gene products should be consideredwhen designing screening assays. In screening assays, a transgenicorganism may not be selected if the gene product has been inactivateddue to post-translational effects such as proteolytic cleavage.

FIG. 21. Production of an improved organism or strain that has a desiredtrait. In another embodiment of this invention, it provides a generalmethod to introduce a library of mutagenized nucleotide sequences intoan organism, and to screen the transgenic organisms or strain forvarious phenotypes (preferably using a high throughput method).Screening methods that pertain to methods of genomics, proteomics, RNAprofiling, metabolomics, and lipid profiling can be utilized to identifya subset of desired mutants, such as “up-mutants”. Optionally, mutationscan be combined and the organisms rescreened and/or a second library canbe introduced into the transgenic organisms and the process repeated. Ina preferred embodiment, the starting population is comprised of anorganism strain to be subjected to improvement or evolution in order toproduce a resultant population comprised of an improved organism strainthat has a desired trait.

FIG. 22. Reassortment of polynucleotide sequences to produce an improvedsequence that has a desired trait. Another goal of this invention is toprovide a method to prepare mutagenized polynucleotides, to screen thepolynucleotide products, and thereby produce an improved sequence with adesired trait. For example, as illustrated in FIG. 22, mutagenizedpolynucleotides can be generated by in vivo based reassortment methodssuch as transposon-based or homologous recombination-based methods.Subsequently, the transgenic organisms can be screened to select adesirable subset of mutants (such as those with an enhanced trait or “upmutant”). The subset of organisms can be selected and various mutationscan be combined. The resultant strain can undergo further rounds ofselection for an “up mutant” and/or the improved genomic sequence can beselected and determined.

FIG. 23. Strain Improvement. FIG. 23 further illustrates the utility ofthis invention for the generation of improved strains or organisms. Thisschematic illustratively compares classical and modified classicalgenetic methods with a method provided in this invention. This inventionprovides for the generation of strains that harbor more mutations thanare typically harbored by strains generated by classical geneticapproaches. The generation of strains with numerous mutations andsubsequent screening of such strains will allow for the selection ofimproved strains. As illustrated in this figure, an embodiment of thisinvention is to generate random clones (e.g., that are a result of threelevels of mutagenesis), create transgenic organisms upon the transfer ofthese clones in a high throughput process, allow in vivo recombinationdue to homologous recombination, transposon insertion, or suicideplasmids, and identify strains with improved characteristics byscreening. Subsequently, the clones that rendered improvedcharacteristics could be identified and combined into one strain withthe goal of generating an improved strain due to multiple geneticmutations.

FIG. 24. Iterative Strain Improvement. This figure illustrates how thisinvention provides a method for iterative strain improvement by allowingmultiple rounds of mutagenesis, recombination, and selection. In thisschematic, a library from an organism is subjected to mutagenesis andthen transformed into a parent organism. Once in the cell, additionalvariation is introduced by in vivo recombination (e.g., homologousrecombination). Resultant strains are screened for a desired or enhancedtrait (an “up mutant”) and the mutations are identified and sequenced.Subsequently, various set or subsets of identified clones can berecombined to create further strain improvements.

FIG. 25. Illustrative diagram for the introduction of mutations forgenome site saturated mutagenesis. In one sense, this method permits thetargeted construction of markerless deletions, insertions, and pointmutations into a genome (such as a bacterial chromosome) for genome sitesaturation mutagenesis. Libraries of genomes can be mutagenized (andmultiply mutagenized) and introduced into cells, allowing recombinationwith genomic alleles. For example as illustrated in this diagram, asuicide plasmid that carries a mutant allele and the recognition site ofthe yeast meganuclease I-SceI, can be inserted into a genome byhomologous recombination between the mutant and the wild-type alleles.Further recombination results in either a mutant or a wildtypechromosome. Pools of mutants generated from the same genome fragment canbe combined and stored in one position of an array such that everyfragment of the genome can be mutated to saturation.

FIG. 26. Producing polynucleotides via interrupted synthesis methods. Anembodiment of this invention provides for the production ofchimeric/mutagenized polynucleotides (including coding and noncodingregions) generated by incomplete extension. Incomplete extension can beused to generate intermediate products of varying length that ultimatelymay be utilized to generate pools of chimeric/mutagenizedpolynucleotides. Various methods can be utilized to interrupt synthesisof nucleic acids: abbreviated annealing times (as exemplified in FIG.27), decreased dNTP concentrations, multiple monobinders priming onepolybinder template, template chemistry (such as using a template withchemically modified bases), a DNA polymerase with decreased activity,and/or the use of modified nucleotides during synthesis (such as ddCTP).

FIG. 27. Utilizing PCR cycles with abbreviated annealing times forinterrupted synthesis. An embodiment of this invention provides for theproduction of chimeric/mutagenized polynucleotides (including coding andnoncoding regions) generated by interrupted synthesis methods.Variations of standard PCR cycles that utilize abbreviated annealingtimes is one method that can lead to incomplete extension. Asillustrated, there are numerous possible variations (such as, but notlimited to, variations 1-5) that could be utilized.

FIG. 28. Example of a flow chart that is serviceable for performingcomputer-aided analysis according to this invention.

E. DEFINITIONS OF TERMS

In order to facilitate understanding of the examples provided herein,certain frequently occurring methods and/or terms will be described.

The term “agent” is used herein to denote a chemical compound, a mixtureof chemical compounds, an array of spatially localized compounds (e.g.,a VLSIPS peptide array, polynucleotide array, and/or combinatorial smallmolecule array), biological macromolecule, a bacteriophage peptidedisplay library, a bacteriophage antibody (e.g., scFv) display library,a polysome peptide display library, or an extract made form biologicalmaterials such as bacteria, plants, fungi, or animal (particularmammalian) cells or tissues. Agents are evaluated for potential activityas anti-neoplastics, anti-inflammatories or apoptosis modulators byinclusion in screening assays described hereinbelow. Agents areevaluated for potential activity as specific protein interactioninhibitors (i.e., an agent which selectively inhibits a bindinginteraction between two predetermined polypeptides but which doe snotsubstantially interfere with cell viability) by inclusion in screeningassays described hereinbelow.

An “ambiguous base requirement” in a restriction site refers to anucleotide base requirement that is not specified to the fullest extent,i.e. that is not a specific base (such as, in a non-limitingexemplification, a specific base selected from A, C, G, and T), but thatare used in the art as well as herein to represent ambiguity in basesinclude the following: R=G or A; Y=C or T; M=A or C; K=G or T; S=G or C;W=A or T; H=A or C or T; B=G or T or C; V=G or C or A; D=G or A or T;N=A or C or G or T.

The term “amino acid” as used herein refers to any organic compound thatcontains an amino group (—NH₂) and a carboxyl group (—COOH); preferablyeither as free groups or alternatively after condensation as part ofpeptide bonds. The “twenty naturally encoded polypeptide-formingalpha-amino acids” are understood in the art and refer to: alanine (alaor A), arginine (arg or R), asparagine (asn or N), aspartic acid (asp orD), cysteine (cys or C), gluatamic acid (glu or E), glutamine (gin orQ), glycine (gly or G), histidine (his or H), isoleucine (ile or 1),leucine (leu or L), lysine (lys or K), methionine (met or M),phenylalanine (phe or F), proline (pro or P), serine (ser or S),threonine (thr or T), tryptophan (trp or W), tyrosine (tyr or Y), andvaline (val or V).

The term “amplification” means that the number of copies of apolynucleotide is increased.

The term “antibody”, as used herein, refers to intact immunoglobulinmolecules, as well as fragments of immunoglobulin molecules, such asFab, Fab′, (Fab′)₂, Fv, and SCA fragments, that are capable of bindingto an epitope of an antigen. These antibody fragments, which retain someability to selectively bind to an antigen (e.g., a polypeptide antigen)of the antibody from which they are derived, can be made using wellknown methods in the art (see, e.g., Harlow and Lane, supra), and aredescribed further, as follows.

-   -   (1) An Fab fragment consists of a monovalent antigen-binding        fragment of an antibody molecule, and can be produced by        digestion of a whole antibody molecule with the enzyme papain,        to yield a fragment consisting of an intact light chain and a        portion of a heavy chain.    -   (2) An Fab′ fragment of an antibody molecule can be obtained by        treating a whole antibody molecule with pepsin, followed by        reduction, to yield a molecule fragments are obtained per        antibody molecule treated in this manner.    -   (3) An (Fab′)₂ fragment of an antibody can be obtained by        treating a whole antibody molecule with the enzyme pepsin,        without subsequent reduction. A (Fab′)₂ fragment is a dimer of        two Fab′ fragments, held together by two disulfide bonds.    -   (4) An Fv fragment is defined as a genetically engineered        fragment containing the variable region of a light chain and the        variable region of a heavy chain expressed as two chains.    -   (5) An single chain antibody (“SCA”) is a genetically engineered        single chain molecule containing the variable region of a light        chain and the variable region of a heavy chain, linked by a        suitable, flexible polypeptide linker.

The term “Applied Molecular Evolution” (“AME”) means the application ofan evolutionary design algorithm to a specific, useful goal. While manydifferent library formats for AME have been reported forpolynucleotides, peptides and proteins (phage, lad and polysomes), noneof these formats have provided for recombination by random cross-oversto deliberately create a combinatorial library.

A molecule that has a “chimeric property” is a molecule that is: 1) inpart homologous and in part heterologous to a first reference molecule;while 2) at the same time being in part homologous and in partheterologous to a second reference molecule; without 3) precluding thepossibility of being at the same time in part homologous and in partheterologous to still one or more additional reference molecules. In anon-limiting embodiment, a chimeric molecule may be prepared byassemblying a reassortment of partial molecular sequences. In anon-limiting aspect, a chimeric polynucleotide molecule may be preparedby synthesizing the chimeric polynucleotide using plurality of moleculartemplates, such that the resultant chimeric polynucleotide hasproperties of a plurality of templates.

The term “cognate” as used herein refers to a gene sequence that isevolutionarily and functionally related between species. For example,but not limitation, in the human genome the human CD4 gene is thecognate gene to the mouse 3d4 gene, since the sequences and structuresof these two genes indicate that they are highly homologous and bothgenes encode a protein which functions in signaling T cell activationthrough MHC class II-restricted antigen recognition.

A “comparison window,” as used herein, refers to a conceptual segment ofat least 20 contiguous nucleotide positions wherein a polynucleotidesequence may be compared to a reference sequence of at least 20contiguous nucleotides and wherein the portion of the polynucleotidesequence in the comparison window may comprise additions or deletions(i.e., gaps) of 20 percent or less as compared to the reference sequence(which does not comprise additions or deletions) for optimal alignmentof the two sequences. Optimal alignment of sequences for aligning acomparison window may be conducted by the local homology algorithm ofSmith (Smith and Waterman, Adv Appl Math, 1981; Smith and Waterman, JTeor Biol, 1981; Smith and Waterman, J Mol Biol, 1981; Smith et al, JMol Evol, 1981), by the homology alignment algorithm of Needleman(Needleman and Wuncsch, 1970), by the search of similarity method ofPearson (Pearson and Lipman, 1988), by computerized implementations ofthese algorithms (GAP, BESTFIT, FASTA, and TFASTA in the WisconsinGenetics Software Package Release 7.0, Genetics Computer Group, 575Science Dr., Madison, Wis.), or by inspection, and the best alignment(i.e., resulting in the highest percentage of homology over thecomparison window) generated by the various methods is selected.

As used herein, the term “complementarity-determining region” and “CDR”refer to the art-recognized term as exemplified by the Kabat and ChothiaCDR definitions also generally known as supervariable regions orhypervariable loops (Chothia and Lesk, 1987; Clothia et al, 1989; Kabatet al, 1987; and Tramontano et al, 1990). Variable region domainstypically comprise the amino-terminal approximately 105-115 amino acidsof a naturally-occurring immunoglobulin chain (e.g., amino acids 1-110),although variable domains somewhat shorter or longer are also suitablefor forming single-chain antibodies.

“Conservative amino acid substitutions” refer to the interchangeabilityof residues having similar side chains. For example, a group of aminoacids having aliphatic side chains is glycine, alanine, valine, leucine,and isoleucine; a group of amino acids having aliphatic-hydroxyl sidechains is serine and threonine; a group of amino acids havingamide-containing side chains is asparagine and glutamine; a group ofamino acids having aromatic side chains is phenylalanine, tyrosine, andtryptophan; a group of amino acids having basic side chains is lysine,arginine, and histidine; and a group of amino acids havingsulfur-containing side chains is cysteine and methionine. Preferredconservative amino acids substitution groups are:valine-leucine-isoleucine, phenylalanine-tyrosine, lysine-arginine,alanine-valine, and asparagine-glutamine.

The term “corresponds to” is used herein to mean that a polynucleotidesequence is homologous (i.e., is identical, not strictly evolutionarilyrelated) to all or a portion of a reference polynucleotide sequence, orthat a polypeptide sequence is identical to a reference polypeptidesequence. In contradistinction, the term “complementary to” is usedherein to mean that the complementary sequence is homologous to all or aportion of a reference polynucleotide sequence. For illustration, thenucleotide sequence “TATAC” corresponds to a reference “TATAC” and iscomplementary to a reference sequence “GTATA.”

The term “degrading effective” amount refers to the amount of enzymewhich is required to process at least 50% of the substrate, as comparedto substrate not contacted with the enzyme. Preferably, at least 80% ofthe substrate is degraded.

As used herein, the term “defined sequence framework” refers to a set ofdefined sequences that are selected on a non-random basis, generally onthe basis of experimental data or structural data; for example, adefined sequence framework may comprise a set of amino acid sequencesthat are predicted to form a β-sheet structure or may comprise a leucinezipper heptad repeat motif, a zinc-finger domain, among othervariations. A “defined sequence kernal” is a set of sequences whichencompass a limited scope of variability. Whereas (1) a completelyrandom 10-mer sequence of the 20 conventional amino acids can be any of(20)¹⁰ sequences, and (2) a pseudorandom 10-mer sequence of the 20conventional amino acids can be any of (20)¹⁰ sequences but will exhibita bias for certain residues at certain positions and/or overall, (3) adefined sequence kernal is a subset of sequences if each residueposition was allowed to be any of the allowable 20 conventional aminoacids (and/or allowable unconventional amino/imino acids). A definedsequence kernal generally comprises variant and invariant residuepositions and/or comprises variant residue positions which can comprisea residue selected from a defined subset of amino acid residues), andthe like, either segmentally or over the entire length of the individualselected library member sequence. Defined sequence kernels can refer toeither amino acid sequences or polynucleotide sequences. Of illustrationand not limitation, the sequences (NNK)₁₀ and (NNM)₁₀, wherein Nrepresents A, T, G, or C; K represents G or T; and M represents A or C,are defined sequence kernels.

“Digestion” of DNA refers to catalytic cleavage of the DNA with arestriction enzyme that acts only at certain sequences in the DNA. Thevarious restriction enzymes used herein are commercially available andtheir reaction conditions, cofactors and other requirements were used aswould be known to the ordinarily skilled artisan. For analyticalpurposes, typically 1 μg of plasmid or DNA fragment is used with about 2units of enzyme in about 20 μl of buffer solution. For the purpose ofisolating DNA fragments for plasmid construction, typically 5 to 50 μgof DNA are digested with 20 to 250 units of enzyme in a larger volume.Appropriate buffers and substrate amounts for particular restrictionenzymes are specified by the manufacturer. Incubation times of about 1hour at 37° C. are ordinarily used, but may vary in accordance with thesupplier's instructions. After digestion the reaction is electrophoreseddirectly on a gel to isolate the desired fragment.

“Directional ligation” refers to a ligation in which a 5′ end and a 3′end of a polynuclotide are different enough to specify a preferredligation orientation. For example, an otherwise untreated and undigestedPCR product that has two blunt ends will typically not have a preferredligation orientation when ligated into a cloning vector digested toproduce blunt ends in its multiple cloning site; thus, directionalligation will typically not be displayed under these circumstances. Incontrast, directional ligation will typically displayed when a digestedPCR product having a 5′ EcoR I-treated end and a 3′ BamH I-is ligatedinto a cloning vector that has a multiple cloning site digested withEcoR I and BamH I.

The term “DNA shuffling” is used herein to indicate recombinationbetween substantially homologous but non-identical sequences, in someembodiments DNA shuffling may involve crossover via non-homologousrecombination, such as via cer/lox and/or flp/frt systems and the like.

As used in this invention, the term “epitope” refers to an antigenicdeterminant on an antigen, such as a phytase polypeptide, to which theparatope of an antibody, such as an phytase-specific antibody, binds.Antigenic determinants usually consist of chemically active surfacegroupings of molecules, such as amino acids or sugar side chains, andcan have specific three-dimensional structural characteristics, as wellas specific charge characteristics. As used herein “epitope” refers tothat portion of an antigen or other macromolecule capable of forming abinding interaction that interacts with the variable region binding bodyof an antibody. Typically, such binding interaction is manifested as anintermolecular contact with one or more amino acid residues of a CDR.

The terms “fragment”, “derivative” and “analog” when referring to areference polypeptide comprise a polypeptide which retains at least onebiological function or activity that is at least essentially same asthat of the reference polypeptide. Furthermore, the terms “fragment”,“derivative” or “analog” are exemplified by a “pro-form” molecule, suchas a low activity proprotein that can be modified by cleavage to producea mature enzyme with significantly higher activity.

A method is provided herein for producing from a template polypeptide aset of progeny polypeptides in which a “full range of single amino acidsubstitutions” is represented at each amino acid position. As usedherein, “full range of single amino acid substitutions” is in referenceto the naturally encoded 20 naturally encoded polypeptide-formingalpha-amino acids, as described herein.

The term “gene” means the segment of DNA involved in producing apolypeptide chain; it includes regions preceding and following thecoding region (leader and trailer) as well as intervening sequences(introns) between individual coding segments (exons).

“Genetic instability”, as used herein, refers to the natural tendency ofhighly repetitive sequences to be lost through a process of reductiveevents generally involving sequence simplification through the loss ofrepeated sequences. Deletions tend to involve the loss of one copy of arepeat and everything between the repeats.

The term “heterologous” means that one single-stranded nucleic acidsequence is unable to hybridize to another single-stranded nucleic acidsequence or its complement. Thus areas of heterology means that areas ofpolynucleotides or polynucleotides have areas or regions within theirsequence which are unable to hybridize to another nucleic acid orpolynucleotide. Such regions or areas are for example areas ofmutations.

The term “homologous” or “homeologous” means that one single-strandednucleic acid nucleic acid sequence may hybridize to a complementarysingle-stranded nucleic acid sequence. The degree of hybridization maydepend on a number of factors including the amount of identity betweenthe sequences and the hybridization conditions such as temperature andsalt concentrations as discussed later. Preferably the region ofidentity is greater than about 5 bp, more preferably the region ofidentity is greater than 10 bp.

An immunoglobulin light or heavy chain variable region consists of a“framework” region interrupted by three hypervariable regions, alsocalled CDR's. The extent of the framework region and CDR's have beenprecisely defined; see “Sequences of Proteins of Immunological Interest”(Kabat et al, 1987). The sequences of the framework regions of differentlight or heavy chains are relatively conserved within a specie. As usedherein, a “human framework region” is a framework region that issubstantially identical (about 85 or more, usually 90-95 or more) to theframework region of a naturally occurring human immunoglobulin. theframework region of an antibody, that is the combined framework regionsof the constituent light and heavy chains, serves to position and alignthe CDR's. The CDR's are primarily responsible for binding to an epitopeof an antigen.

The benefits of this invention extend to “commercial applications” (orcommercial processes), which term is used to include applications incommercial industry proper (or simply industry) as well asnon-commercial commercial applications (e.g. biomedical research at anon-profit institution). Relevant applications include those in areas ofdiagnosis, medicine, agriculture, manufacturing, and academia.

The term “identical” or “identity” means that two nucleic acid sequenceshave the same sequence or a complementary sequence. Thus, “areas ofidentity” means that regions or areas of a polynucleotide or the overallpolynucleotide are identical or complementary to areas of anotherpolynucleotide or the polynucleotide.

The term “isolated” means that the material is removed from its originalenvironment (e.g., the natural environment if it is naturallyoccurring). For example, a naturally-occurring polynucleotide or enzymepresent in a living animal is not isolated, but the same polynucleotideor enzyme, separated from some or all of the coexisting materials in thenatural system, is isolated. Such polynucleotides could be part of avector and/or such polynucleotides or enzymes could be part of acomposition, and still be isolated in that such vector or composition isnot part of its natural environment.

By “isolated nucleic acid” is meant a nucleic acid, e.g., a DNA or RNAmolecule, that is not immediately contiguous with the 5′ and 3′ flankingsequences with which it normally is immediately contiguous when presentin the naturally occurring genome of the organism from which it isderived. The term thus describes, for example, a nucleic acid that isincorporated into a vector, such as a plasmid or viral vector; a nucleicacid that is incorporated into the genome of a heterologous cell (or thegenome of a homologous cell, but at a site different from that at whichit naturally occurs); and a nucleic acid that exists as a separatemolecule, e.g., a DNA fragment produced by PCR amplification orrestriction enzyme digestion, or an RNA molecule produced by in vitrotranscription. The term also describes a recombinant nucleic acid thatforms part of a hybrid gene encoding additional polypeptide sequencesthat can be used, for example, in the production of a fusion protein.

As used herein “ligand” refers to a molecule, such as a random peptideor variable segment sequence, that is recognized by a particularreceptor. As one of skill in the art will recognize, a molecule (ormacromolecular complex) can be both a receptor and a ligand. In general,the binding partner having a smaller molecular weight is referred to asthe ligand and the binding partner having a greater molecular weight isreferred to as a receptor.

“Ligation” refers to the process of forming phosphodiester bonds betweentwo double stranded nucleic acid fragments (Sambrook et al, 1982, p.146; Sambrook, 1989). Unless otherwise provided, ligation may beaccomplished using known buffers and conditions with 10 units of T4 DNAligase (“ligase”) per 0.5 μg of approximately equimolar amounts of theDNA fragments to be ligated.

As used herein, “linker” or “spacer” refers to a molecule or group ofmolecules that connects two molecules, such as a DNA binding protein anda random peptide, and serves to place the two molecules in a preferredconfiguration, e.g., so that the random peptide can bind to a receptorwith minimal steric hindrance from the DNA binding protein.

As used herein, a “molecular property to be evolved” includes referenceto molecules comprised of a polynucleotide sequence, molecules comprisedof a polypeptide sequence, and molecules comprised in part of apolynucleotide sequence and in part of a polypeptide sequence.Particularly relevant—but by no means limiting—examples of molecularproperties to be evolved include enzymatic activities at specifiedconditions, such as related to temperature; salinity; pressure; pH; andconcentration of glycerol, DMSO, detergent, &/or any other molecularspecies with which contact is made in a reaction environment. Additionalparticularly relevant—but by no means limiting—examples of molecularproperties to be evolved include stabilities—e.g. the amount of aresidual molecular property that is present after a specified exposuretime to a specified environment, such as may be encountered duringstorage.

The term “mutations” includes changes in the sequence of a wild-type orparental nucleic acid sequence or changes in the sequence of a peptide.Such mutations may be point mutations such as transitions ortransversions. The mutations may be deletions, insertions orduplications. A mutation can also be a “chimerization”, which isexemplified in a progeny molecule that is generated to contain part orall of a sequence of one parental molecule as well as part or all of asequence of at least one other parental molecule. This inventionprovides for both chimeric polynucleotides and chimeric polypeptides.

As used herein, the degenerate “N,N,G/T” nucleotide sequence represents32 possible triplets, where “N” can be A, C, G or T.

The term “naturally-occurring” as used herein as applied to the objectrefers to the fact that an object can be found in nature. For example, apolypeptide or polynucleotide sequence that is present in an organism(including viruses) that can be isolated from a source in nature andwhich has not been intentionally modified by man in the laboratory isnaturally occurring. Generally, the term naturally occurring refers toan object as present in a non-pathological (un-diseased) individual,such as would be typical for the species.

As used herein, a “nucleic acid molecule” is comprised of at least onebase or one base pair, depending on whether it is single-stranded ordouble-stranded, respectively. Furthermore, a nucleic acid molecule maybelong exclusively or chimerically to any group of nucleotide-containingmolecules, as exemplified by, but not limited to, the following groupsof nucleic acid molecules: RNA, DNA, genomic nucleic acids, non-genomicnucleic acids, naturally occurring and not naturally occurring nucleicacids, and synthetic nucleic acids. This includes, by way ofnon-limiting example, nucleic acids associated with any organelle, suchas the mitochondria, ribosomal RNA, and nucleic acid molecules comprisedchimerically of one or more components that are not naturally occurringalong with naturally occurring components.

Additionally, a “nucleic acid molecule” may contain in part one or morenon-nucleotide-based components as exemplified by, but not limited to,amino acids and sugars. Thus, by way of example, but not limitation, aribozyme that is in part nucleotide-based and in part protein-based isconsidered a “nucleic acid molecule”.

In addition, by way of example, but not limitation, a nucleic acidmolecule that is labeled with a detectable moiety, such as a radioactiveor alternatively a non-radioactive label, is likewise considered a“nucleic acid molecule”.

The terms “nucleic acid sequence coding for” or a “DNA coding sequenceof” or a “nucleotide sequence encoding” a particular enzyme—as well asother synonymous terms—refer to a DNA sequence which is transcribed andtranslated into an enzyme when placed under the control of appropriateregulatory sequences. A “promotor sequence” is a DNA regulatory regioncapable of binding RNA polymerase in a cell and initiating transcriptionof a downstream (3′ direction) coding sequence. The promoter is part ofthe DNA sequence. This sequence region has a start codon at its 3′terminus. The promoter sequence does include the minimum number of baseswhere elements necessary to initiate transcription at levels detectableabove background. However, after the RNA polymerase binds the sequenceand transcription is initiated at the start codon (3′ terminus with apromoter), transcription proceeds downstream in the 3′ direction. Withinthe promotor sequence will be found a transcription initiation site(conveniently defined by mapping with nuclease S1) as well as proteinbinding domains (consensus sequences) responsible for the binding of RNApolymerase.

The terms “nucleic acid encoding an enzyme (protein)” or “DNA encodingan enzyme (protein)” or “polynucleotide encoding an enzyme (protein)”and other synonymous terms encompasses a polynucleotide which includesonly coding sequence for the enzyme as well as a polynucleotide whichincludes additional coding and/or non-coding sequence.

In one preferred embodiment, a “specific nucleic acid molecule species”is defined by its chemical structure, as exemplified by, but not limitedto, its primary sequence. In another preferred embodiment, a specific“nucleic acid molecule species” is defined by a function of the nucleicacid species or by a function of a product derived from the nucleic acidspecies. Thus, by way of non-limiting example, a “specific nucleic acidmolecule species” may be defined by one or more activities or propertiesattributable to it, including activities or properties attributable itsexpressed product.

The instant definition of “assembling a working nucleic acid sample intoa nucleic acid library” includes the process of incorporating a nucleicacid sample into a vector-based collection, such as by ligation into avector and transformation of a host. A description of relevant vectors,hosts, and other reagents as well as specific non-limiting examplesthereof are provided hereinafter. The instant definition of “assemblinga working nucleic acid sample into a nucleic acid library” also includesthe process of incorporating a nucleic acid sample into anon-vector-based collection, such as by ligation to adaptors. Preferablythe adaptors can anneal to PCR primers to facilitate amplification byPCR.

Accordingly, in a non-limiting embodiment, a “nucleic acid library” iscomprised of a vector-based collection of one or more nucleic acidmolecules. In another preferred embodiment a “nucleic acid library” iscomprised of a non-vector-based collection of nucleic acid molecules. Inyet another preferred embodiment a “nucleic acid library” is comprisedof a combined collection of nucleic acid molecules that is in partvector-based and in part non-vector-based. Preferably, the collection ofmolecules comprising a library is searchable and separable according toindividual nucleic acid molecule species.

The present invention provides a “nucleic acid construct” oralternatively a “nucleotide construct” or alternatively a “DNAconstruct”. The term “construct” is used herein to describe a molecule,such as a polynucleotide (e.g., a phytase polynucleotide) may optionallybe chemically bonded to one or more additional molecular moieties, suchas a vector, or parts of a vector. In a specific—but by no meanslimiting—aspect, a nucleotide construct is exemplified by a DNAexpression DNA expression constructs suitable for the transformation ofa host cell.

An “oligonucleotide” (or synonymously an “oligo”) refers to either asingle stranded polydeoxynucleotide or two complementarypolydeoxynucleotide strands which may be chemically synthesized. Suchsynthetic oligonucleotides may or may not have a 5′ phosphate. Thosethat do not will not ligate to another oligonucleotide without adding aphosphate with an ATP in the presence of a kinase. A syntheticoligonucleotide will ligate to a fragment that has not beendephosphorylated. To achieve polymerase-based amplification (such aswith PCR), a “32—fold degenerate oligonucleotide that is comprised of,in series, at least a first homologous sequence, a degenerate N,N,G/Tsequence, and a second homologous sequence” is mentioned. As used inthis context, “homologous” is in reference to homology between the oligoand the parental polynucleotide that is subjected to thepolymerase-based amplification.

As used herein, the term “operably linked” refers to a linkage ofpolynucleotide elements in a functional relationship. A nucleic acid is“operably linked” when it is placed into a functional relationship withanother nucleic acid sequence. For instance, a promoter or enhancer isoperably linked to a coding sequence if it affects the transcription ofthe coding sequence. Operably linked means that the DNA sequences beinglinked are typically contiguous and, where necessary to join two proteincoding regions, contiguous and in reading frame.

A coding sequence is “operably linked to” another coding sequence whenRNA polymerase will transcribe the two coding sequences into a singlemRNA, which is then translated into a single polypeptide having aminoacids derived from both coding sequences. The coding sequences need notbe contiguous to one another so long as the expressed sequences areultimately processed to produce the desired protein.

As used herein the term “parental polynucleotide set” is a set comprisedof one or more distinct polynucleotide species. Usually this term fisused in reference to a progeny polynucleotide set which is preferablyobtained by mutagenization of the parental set, in which case the terms“parental”, “starting” and “template” are used interchangeably.

As used herein the term “physiological conditions” refers totemperature, pH, ionic strength, viscosity, and like biochemicalparameters which are compatible with a viable organism, and/or whichtypically exist intracellularly in a viable cultured yeast cell ormammalian cell. For example, the intracellular conditions in a yeastcell grown under typical laboratory culture conditions are physiologicalconditions. Suitable in vitro reaction conditions for in vitrotranscription cocktails are generally physiological conditions. Ingeneral, in vitro physiological conditions comprise 50-200 mM NaCl orKCl, pH 6.5-8.5, 20-45 C and 0.001-10 mM divalent cation (e.g., Mg⁺⁺,Ca⁺⁺); preferably about 150 mM NaCl or KCl, pH 7.2-7.6, 5 mM divalentcation, and often include 0.01-1.0 percent nonspecific protein (e.g.,BSA). A non-ionic detergent (Tween, NP-40, Triton X-100) can often bepresent, usually at about 0.001 to 2%, typically 0.05-0.2% (v/v).Particular aqueous conditions may be selected by the practitioneraccording to conventional methods. For general guidance, the followingbuffered aqueous conditions may be applicable: 10-250 mM NaCl, 5-50 mMTris HCl, pH 5-8, with optional addition of divalent cation(s) and/ormetal chelators and/or non-ionic detergents and/or membrane fractionsand/or anti-foam agents and/or scintillants.

Standard convention (5′ to 3′) is used herein to describe the sequenceof double standed polynucleotides.

The term “population” as used herein means a collection of componentssuch as polynucleotides, portions or polynucleotides or proteins. A“mixed population: means a collection of components which belong to thesame family of nucleic acids or proteins (i.e., are related) but whichdiffer in their sequence (i.e., are not identical) and hence in theirbiological activity.

A molecule having a “pro-form” refers to a molecule that undergoes anycombination of one or more covalent and noncovalent chemicalmodifications (e.g. glycosylation, proteolytic cleavage, dimerization oroligomerization, temperature-induced or pH-induced conformationalchange, association with a co-factor, etc.) en route to attain a moremature molecular form having a property difference (e.g. an increase inactivity) in comparison with the reference pro-form molecule. When twoor more chemical modification (e.g. two proteolytic cleavages, or aproteolytic cleavage and a deglycosylation) can be distinguished enroute to the production of a mature molecule, the referemce precursormolecule may be termed a “pre-pro-form” molecule.

As used herein, the term “pseudorandom” refers to a set of sequencesthat have limited variability, such that, for example, the degree ofresidue variability at another position, but any pseudorandom positionis allowed some degree of residue variation, however circumscribed.

“Quasi-repeated units”, as used herein, refers to the repeats to bere-assorted and are by definition not identical. Indeed the method isproposed not only for practically identical encoding units produced bymutagenesis of the identical starting sequence, but also thereassortment of similar or related sequences which may divergesignificantly in some regions. Nevertheless, if the sequences containsufficient homologies to be reasserted by this approach, they can bereferred to as “quasi-repeated” units.

As used herein “random peptide library” refers to a set ofpolynucleotide sequences that encodes a set of random peptides, and tothe set of random peptides encoded by those polynucleotide sequences, aswell as the fusion proteins contain those random peptides.

As used herein, “random peptide sequence” refers to an amino acidsequence composed of two or more amino acid monomers and constructed bya stochastic or random process. A random peptide can include frameworkor scaffolding motifs, which may comprise invariant sequences.

As used herein, “receptor” refers to a molecule that has an affinity fora given ligand. Receptors can be naturally occurring or syntheticmolecules. Receptors can be employed in an unaltered state or asaggregates with other species. Receptors can be attached, covalently ornon-covalently, to a binding member, either directly or via a specificbinding substance. Examples of receptors include, but are not limitedto, antibodies, including monoclonal antibodies and antisera reactivewith specific antigenic determinants (such as on viruses, cells, orother materials), cell membrane receptors, complex carbohydrates andglycoproteins, enzymes, and hormone receptors.

“Recombinant” enzymes refer to enzymes produced by recombinant DNAtechniques, i.e., produced from cells transformed by an exogenous DNAconstruct encoding the desired enzyme. “Synthetic” enzymes are thoseprepared by chemical synthesis.

The term “related polynucleotides” means that regions or areas of thepolynucleotides are identical and regions or areas of thepolynucleotides are heterologous.

“Reductive reassortment”, as used herein, refers to the increase inmolecular diversity that is accrued through deletion (and/or insertion)events that are mediated by repeated sequences.

The following terms are used to describe the sequence relationshipsbetween two or more polynucleotides: “reference sequence,” “comparisonwindow,” “sequence identity,” “percentage of sequence identity,” and“substantial identity.”

A “reference sequence” is a defined sequence used as a basis for asequence comparison; a reference sequence may be a subset of a largersequence, for example, as a segment of a full-length cDNA or genesequence given in a sequence listing, or may comprise a complete cDNA orgene sequence. Generally, a reference sequence is at least 20nucleotides in length, frequently at least 25 nucleotides in length, andoften at least 50 nucleotides in length. Since two polynucleotides mayeach (1) comprise a sequence (i.e., a portion of the completepolynucleotide sequence) that is similar between the two polynucleotidesand (2) may further comprise a sequence that is divergent between thetwo polynucleotides, sequence comparisons between two (or more)polynucleotides are typically performed by comparing sequences of thetwo polynucleotides over a “comparison window” to identify and comparelocal regions of sequence similarity.

“Repetitive Index (RI)”, as used herein, is the average number of copiesof the quasi-repeated units contained in the cloning vector.

The term “restriction site” refers to a recognition sequence that isnecessary for the manifestation of the action of a restriction enzyme,and includes a site of catalytic cleavage. It is appreciated that a siteof cleavage may or may not be contained within a portion of arestriction site that comprises a low ambiguity sequence (i.e. asequence containing the principal determinant of the frequency ofoccurrence of the restriction site). Thus, in many cases, relevantrestriction sites contain only a low ambiguity sequence with an internalcleavage site (e.g. G/AATTC in the EcoR I site) or an immediatelyadjacent cleavage site (e.g. /CCWGG in the EcoR II site). In othercases, relevant restriction enzymes [e.g. the Eco57 I site orCTGAAG(16/14)] contain a low ambiguity sequence (e.g. the CTGAAGsequence in the Eco57 I site) with an external cleavage site (e.g. inthe N₁₆ portion of the Eco57 I site). When an enzyme (e.g. a restrictionenzyme) is said to “cleave” a polynucleotide, it is understood to meanthat the restriction enzyme catalyzes or facilitates a cleavage of apolynucleotide.

In a non-limiting aspect, a “selectable polynucleotide” is comprised ofa 5′ terminal region (or end region), an intermediate region (i.e. aninternal or central region), and a 3′ terminal region (or end region).As used in this aspect, a 5′ terminal region is a region that is locatedtowards a 5′ polynucleotide terminus (or a 5′ polynucleotide end); thusit is either partially or entirely in a 5′ half of a polynucleotide.Likewise, a 3′ terminal region is a region that is located towards a 3′polynucleotide terminus (or a 3′ polynucleotide end); thus it is eitherpartially or entirely in a 3′ half of a polynucleotide. As used in thisnon-limiting exemplification, there may be sequence overlap between anytwo regions or even among all three regions.

The term “sequence identity” means that two polynucleotide sequences areidentical (i.e., on a nucleotide-by-nucleotide basis) over the window ofcomparison. The term “percentage of sequence identity” is calculated bycomparing two optimally aligned sequences over the window of comparison,determining the number of positions at which the identical nucleic acidbase (e.g., A, T, C, G, U, or 1) occurs in both sequences to yield thenumber of matched positions, dividing the number of matched positions bythe total number of positions in the window of comparison (i.e., thewindow size), and multiplying the result by 100 to yield the percentageof sequence identity. This “substantial identity”, as used herein,denotes a characteristic of a polynucleotide sequence, wherein thepolynucleotide comprises a sequence having at least 80 percent sequenceidentity, preferably at least 85 percent identity, often 90 to 95percent sequence identity, and most commonly at least 99 percentsequence identity as compared to a reference sequence of a comparisonwindow of at least 25-50 nucleotides, wherein the percentage of sequenceidentity is calculated by comparing the reference sequence to thepolynucleotide sequence which may include deletions or additions whichtotal 20 percent or less of the reference sequence over the window ofcomparison.

As known in the art “similarity” between two enzymes is determined bycomparing the amino acid sequence and its conserved amino acidsubstitutes of one enzyme to the sequence of a second enzyme. Similaritymay be determined by procedures which are well-known in the art, forexample, a BLAST program (Basic Local Alignment Search Tool at theNational Center for Biological Information).

As used herein, the term “single-chain antibody” refers to a polypeptidecomprising a V_(H) domain and a V_(L) domain in polypeptide linkage,generally liked via a spacer peptide (e.g., [Gly-Gly-Gly-Gly-Ser]_(x)),and which may comprise additional amino acid sequences at the amino-and/or carboxy-termini. For example, a single-chain antibody maycomprise a tether segment for linking to the encoding polynucleotide. Asan example, a scFv is a single-chain antibody. Single-chain antibodiesare generally proteins consisting of one or more polypeptide segments ofat least 10 contiguous amino substantially encoded by genes of theimmunoglobulin superfamily (e.g., see Williams and Barclay, 1989, pp.361-368, which is incorporated herein by reference), most frequentlyencoded by a rodent, non-human primate, avian, porcine bovine, ovine,goat, or human heavy chain or light chain gene sequence. A functionalsingle-chain antibody generally contains a sufficient portion of animmunoglobulin superfamily gene product so as to retain the property ofbinding to a specific target molecule, typically a receptor or antigen(epitope).

The members of a pair of molecules (e.g., an antibody-antigen pair or anucleic acid pair) are said to “specifically bind” to each other if theybind to each other with greater affinity than to other, non-specificmolecules. For example, an antibody raised against an antigen to whichit binds more efficiently than to a non-specific protein can bedescribed as specifically binding to the antigen. (Similarly, a nucleicacid probe can be described as specifically binding to a nucleic acidtarget if it forms a specific duplex with the target by base pairinginteractions (see above).)

“Specific hybridization” is defined herein as the formation of hybridsbetween a first polynucleotide and a second polynucleotide (e.g., apolynucleotide having a distinct but substantially identical sequence tothe first polynucleotide), wherein substantially unrelatedpolynucleotide sequences do not form hybrids in the mixture.

The term “specific polynucleotide” means a polynucleotide having certainend points and having a certain nucleic acid sequence. Twopolynucleotides wherein one polynucleotide has the identical sequence asa portion of the second polynucleotide but different ends comprises twodifferent specific polynucleotides.

“Stringent hybridization conditions” means hybridization will occur onlyif there is at least 90% identity, preferably at least 95% identity andmost preferably at least 97% identity between the sequences. SeeSambrook et al, 1989, which is hereby incorporated by reference in itsentirety.

Also included in the invention are polypeptides having sequences thatare “substantially identical” to the sequence of a phytase polypeptide,such as one of SEQ ID 1. A “substantially identical” amino acid sequenceis a sequence that differs from a reference sequence only byconservative amino acid substitutions, for example, substitutions of oneamino acid for another of the same class (e.g., substitution of onehydrophobic amino acid, such as isoleucine, valine, leucine, ormethionine, for another, or substitution of one polar amino acid foranother, such as substitution of arginine for lysine, glutamic acid foraspartic acid, or glutamine for asparagine).

Additionally a “substantially identical” amino acid sequence is asequence that differs from a reference sequence or by one or morenon-conservative substitutions, deletions, or insertions, particularlywhen such a substitution occurs at a site that is not the active sitethe molecule, and provided that the polypeptide essentially retains itsbehavioural properties. For example, one or more amino acids can bedeleted from a phytase polypeptide, resulting in modification of thestructure of the polypeptide, without significantly altering itsbiological activity. For example, amino- or carboxyl-terminal aminoacids that are not required for phytase biological activity can beremoved. Such modifications can result in the development of smalleractive phytase polypeptides.

The present invention provides a “substantially pure enzyme”. The term“substantially pure enzyme” is used herein to describe a molecule, suchas a polypeptide (e.g., a phytase polypeptide, or a fragment thereof)that is substantially free of other proteins, lipids, carbohydrates,nucleic acids, and other biological materials with which it is naturallyassociated. For example, a substantially pure molecule, such as apolypeptide, can be at least 60%, by dry weight, the molecule ofinterest. The purity of the polypeptides can be determined usingstandard methods including, e.g., polyacrylamide gel electrophoresis(e.g., SDS-PAGE), column chromatography (e.g., high performance liquidchromatography (HPLC)), and amino-terminal amino acid sequence analysis.

As used herein, “substantially pure” means an object species is thepredominant species present (i.e., on a molar basis it is more abundantthan any other individual macromolecular species in the composition),and preferably substantially purified fraction is a composition whereinthe object species comprises at least about 50 percent (on a molarbasis) of all macromolecular species present. Generally, a substantiallypure composition will comprise more than about 80 to 90 percent of allmacromolecular species present in the composition. Most preferably, theobject species is purified to essential homogeneity (contaminant speciescannot be detected in the composition by conventional detection methods)wherein the composition consists essentially of a single macromolecularspecies. Solvent species, small molecules (<500 Daltons), and elementalion species are not considered macromolecular species.

As used herein, the term “variable segment” refers to a portion of anascent peptide which comprises a random, pseudorandom, or definedkernal sequence. A variable segment” refers to a portion of a nascentpeptide which comprises a random pseudorandom, or defined kernalsequence. A variable segment can comprise both variant and invariantresidue positions, and the degree of residue variation at a variantresidue position may be limited: both options are selected at thediscretion of the practitioner. Typically, variable segments are about 5to 20 amino acid residues in length (e.g., 8 to 10), although variablesegments may be longer and may comprise antibody portions or receptorproteins, such as an antibody fragment, a nucleic acid binding protein,a receptor protein, and the like.

The term “wild-type” means that the polynucleotide does not comprise anymutations. A “wild type” protein means that the protein will be activeat a level of activity found in nature and will comprise the amino acidsequence found in nature.

The term “working”, as in “working sample”, for example, is simply asample with which one is working. Likewise, a “working molecule”, forexample is a molecule with which one is working.

1. SCREENING AND SELECTION

1.1. Overview of Screening and Selection

Screening is, in general, a two-step process in which one firstdetermines which cells do and do not express a screening marker and thenphysically separates the cells having the desired property. Screeningmarkers include, for example, luciferase, beta-galactosidase, and greenfluorescent protein. Screening can also be done by observing a cellholistically including but not limited to utilizing methods pertainingto genomics, RNA profiling, proteomics, metabolomics, and lipidomics aswell as observing such aspects of growth as colony size, halo formation,etc. Additionally, screening for production of a desired compound, suchas a therapeutic drug or “designer chemical” can be accomplished byobserving binding of cell products to a receptor or ligand, such as on asolid support or on a column. Such screening can additionally beaccomplished by binding to antibodies, as in an ELISA. In some instancesthe screening process is preferably automated so as to allow screeningof suitable numbers of colonies or cells. Some examples of automatedscreening devices include fluorescence activated cell sorting (FACS),especially in conjunction with cells immobilized in agarose (see Powellet. al. Bio/Technology 8: 333-337 (1990); Weaver et. al. Methods 2:234-247 (1991)), automated ELISA assays, scintillation proximity assays(Hart, H. E. et al., Molecular Immunol. 16: 265-267 (1979)) and theformation of fluorescent, colored or UV absorbing compounds on agarplates or in microtitre wells (Krawiec, S., Devel. Indust. Microbiology31: 103-114 (1990)).

Selection is a form of screening in which identification and physicalseparation are achieved simultaneously, for example, by expression of aselectable marker, which, in some genetic circumstances, allows cellsexpressing the marker to survive while other cells die (or vice versa).Selectable markers can include, for example, drug, toxin resistance, ornutrient synthesis genes. Selection is also done by such techniques asgrowth on a toxic substrate to select for hosts having the ability todetoxify a substrate, growth on a new nutrient source to select forhosts having the ability to utilize that nutrient source, competitivegrowth in culture based on ability to utilize a nutrient source, etc.

In particular, uncloned but differentially expressed proteins (e.g.,those induced in response to new compounds, such as biodegradablepollutants in the medium) can be screened by differential display(Appleyard et al. Mol. Gen. Gent. 247: 338-342 (1995)). Hopwood (PhilTrans R. Soc. Lond B 324: 549-562) provides a review of screens forantibiotic production. Omura (Microbio. Rev. 50: 259-279 (1986) andNisbet (Ann Rev. Med. Chem. 21: 149-157 (1986)) disclose screens forantimicrobial agents, including supersensitive bacteria, detection ofbeta-lactamase and D,D-carboxypeptidase inhibition, beta-lactamaseinduction, chromogenic substrates and monoclonal antibody screens.

Antibiotic targets can also be used as screening targets in highthroughput screening. Antifungals are typically screened by inhibitionof fungal growth. Pharmacological agents can be identified as enzymeinhibitors using plates containing the enzyme and a chromogenicsubstrate, or by automated receptor assays. Hydrolytic enzymes (e.g.,proteases, amylases) can be screened by including the substrate in anagar plate and scoring for a hydrolytic clear zone or by using acolorimetric indicator (Steele et al. Ann. Rev. Microbiol. 45: 89-106(1991)). This can be coupled with the use of stains to detect theeffects of enzyme action (such as congo red to detect the extent ofdegradation of celluloses and hemicelluloses).

Tagged substrates can also be used. For example, lipases and esterasescan be screened using different lengths of fatty acids linked toumbelliferyl. The action of lipases or esterases removes this tag fromthe fatty acid, resulting in a quenching or enhancement of umbelliferylfluorescence. These enzymes can be screened in microtiter plates by arobotic device.

1.2. High-Throughput Cellular Screening: Utilizing Various Types of“Omics”

Functional genomics seeks to discover gene function once nucleotidesequence information is available. Proteomics (the study of proteinproperties such as expression, post-translational modifications,interactions, etc.) and metabolomics (analysis of metabolite pools) arefast-emerging fields complementing functional genomics, that provide aglobal, integrated view of cellular processes. The variety of techniquesand methods used in this effort include the use of bioinformatics,gene-array chips, mRNA differential display, disease models, proteindiscovery and expression, and target validation. The ultimate goal ofmany of these efforts has been to develop high-throughput screens forgenes of unknown function. For review see Greenbaum D. et al. GenomeRes, 11(9): 1463-8 (2001).

1.2.1 Genomics

An embodiment of this invention provides for cellular screening; in aparticular embodiment, cellular screening may include genomics. “Highthroughput genomics” refers to application of genomic or genetic data oranalysis techniques that use microarrays or other genomic technologiesto rapidly identify large numbers of genes or proteins, or distinguishtheir structure, expression or function from normal or abnormal cells ortissues. An observer can be a person viewing a slide with a microscopeor an observer who views digital images. Alternatively, an observer canbe a computer-based image analysis system, which automatically observes,analyses and quantitates biological arrayed samples with or without userinteraction. Genomics can refer to various investigative techniques thatare broad in scope but often refers to measuring gene expression formultitudes of genes simultaneously. For a review see Lockhart, D. J. andWinzeler, E. A. 2000. Genomics, gene expression and DNA arrays. Nature,405(6788): 827-36.

1.2.1.1. Biological Chips

1.2.1.1.1. General Considerations

In one aspect the present invention provides for the use of arrays ofoligonucleotide probes immobilized in microfabricated patterns on silicachips for analyzing molecular interactions of biological interest. Insome assay formats, the oligonucleotide probe is tethered, i.e., bycovalent attachment, to a solid support, and arrays of oligonucleotideprobes immobilized on solid supports have been used to detect specificnucleic acid sequences in a target nucleic acid. See, e.g., PCT patentpublication Nos. WO 89/10977 and 89/11548. Others have proposed the useof large numbers of oligonucleotide probes to provide the completenucleic acid sequence of a target nucleic acid but failed to provide anenabling method for using arrays of immobilized probes for this purpose.See U.S. Pat. Nos. 5,202,231 and 5,002,867 and PCT patent publicationNo. WO 93/17126. See U.S. Pat. No. 5,143,854 and PCT patent publicationNos. WO 90/15070 and 92/10092, each of which is incorporated herein byreference. Microfabricated arrays of large numbers of oligonucleotideprobes, called “DNA chips” offer great promise for a wide variety ofapplications. New methods and reagents are required to realize thispromise, and the present invention helps meet that need.

1.2.1.1.2. General Strategies for Utilizing Nucleic Acid Arrays

The invention provides several strategies employing immobilized arraysof probes for comparing a reference sequence of known sequence with atarget sequence showing substantial similarity with the referencesequence, but differing in the presence of, e.g., mutations. In a firstembodiment, the invention provides a tiling strategy employing an arrayof immobilized oligonucleotide probes comprising at least two sets ofprobes. A first probe set comprises a plurality of probes, each probecomprising a segment of at least three nucleotides exactly complementaryto a subsequence of the reference sequence, the segment including atleast one interrogation position complementary to a correspondingnucleotide in the reference sequence. A second probe set comprises acorresponding probe for each probe in the first probe set, thecorresponding probe in the second probe set being identical to asequence comprising the corresponding probe from the first probe set ora subsequence of at least three nucleotides thereof that includes the atleast one interrogation position, except that the at least oneinterrogation position is occupied by a different nucleotide in each ofthe two corresponding probes from the first and second probe sets. Theprobes in the first probe set have at least two interrogation positionscorresponding to two contiguous nucleotides in the reference sequence.One interrogation position corresponds to one of the contiguousnucleotides, and the other interrogation position to the other.

In a second embodiment, the invention provides a tiling strategyemploying an array comprising four probe sets. A first probe setcomprises a plurality of probes, each probe comprising a segment of atleast three nucleotides exactly complementary to a subsequence of thereference sequence, the segment including at least one interrogationposition complementary to a corresponding nucleotide in the referencesequence. Second, third and fourth probe sets each comprise acorresponding probe for each probe in the first probe set.

The probes in the second, third and fourth probe sets are identical to asequence comprising the corresponding probe from the first probe set ora subsequence of at least three nucleotides thereof that includes the atleast one interrogation position, except that the at least oneinterrogation position is occupied by a different nucleotide in each ofthe four corresponding probes from the four probe sets. The first probeset often has at least 100 interrogation positions corresponding to 100contiguous nucleotides in the reference sequence. Sometimes the firstprobe set has an interrogation position corresponding to everynucleotide in the reference sequence. The segment of complementaritywithin the probe set is usually about 9-21 nucleotides. Although probesmay contain leading or trailing sequences in addition to the 9-21sequences, many probes consist exclusively of a 9-21 segment ofcomplementarity.

In a third embodiment, the invention provides immobilized arrays ofprobes tiled for multiple reference sequences. one such array comprisesat least one pair of first and second probe groups, each groupcomprising first and second sets of probes as defined in the firstembodiment. Each probe in the first probe set from the first group isexactly complementary to a subsequence of a first reference sequence,and each probe in the first probe set from the second group is exactlycomplementary to a subsequence of a second reference sequence.

Thus, the first group of probes are tiled with respect to a firstreference sequence and the second group of probes with respect to asecond reference sequence. Each group of probes can also include thirdand fourth sets of probes as defined in the second embodiment. In somearrays of this type, the second reference sequence is a mutated form ofthe first reference sequence.

In a fourth embodiment, the invention provides arrays for block tiling.Block tiling is a species of the general tiling strategies describedabove. The usual unit of a block tiling array is a group of probescomprising a wildtype probe, a first set of three mutant probes and asecond set of three mutant probes. The wildtype probe comprises asegment of at least three nucleotides exactly complementary to asubsequence of a reference sequence. The segment has at least first andsecond interrogation positions corresponding to first and secondnucleotides in the reference sequence. The probes in the first set ofthree mutant probes are each identical to a sequence comprising thewildtype probe or a subsequence of at least three nucleotides thereofincluding the first and second interrogation positions, except in thefirst interrogation position, which is occupied by a differentnucleotide in each of the three mutant probes and the wildtype probe.The probes in the second set of three mutant probes are each identicalto a sequence comprising the wildtype probes or a subsequence of atleast three nucleotides thereof including the first and secondinterrogation positions, except in the second interrogation position,which is occupied by a different nucleotide in each of the three mutantprobes and the wildtype probe.

In a fifth embodiment, the invention provides methods of comparing atarget sequence with a reference sequence using arrays of immobilizedpooled probes. The arrays employed in these methods represent a furtherspecies of the general tiling arrays noted above. In these methods,variants of a reference sequence differing from the reference sequencein at least one nucleotide are identified and each is assigned adesignation. An array of pooled probes is provided, with each pooloccupying a separate cell of the array. Each pool comprises a probecomprising a segment exactly complementary to each variant sequenceassigned a particular designation.

The array is then contacted with a target sequence comprising a variantof the reference sequence. The relative hybridization intensities of thepools in the array to the target sequence are determined. The identityof the target sequence is deduced from the pattern of hybridizationintensities. Often, each variant is assigned a designation having atleast one digit and at least one value for the digit. In this case, eachpool comprises a probe comprising a segment exactly complementary toeach variant sequence assigned a particular value in a particular digit.When variants are assigned successive numbers in a numbering system ofbase m having n digits, n×(m−1) pooled probes are used are used toassign each variant a designation.

In a sixth embodiment, the invention provides a pooled probe for trellistiling, a further species of the general tiling strategy. In trellistiling, the identity of a nucleotide in a target sequence is determinedfrom a comparison of hybridization intensities of three pooled trellisprobes. A pooled trellis probe comprises a segment exactly complementaryto a subsequence of a reference sequence except at a first interrogationposition occupied by a pooled nucleotide N, a second interrogationposition occupied by a pooled nucleotide selected from the group ofthree consisting of (1) M or K, (2) R or Y and (3) S or W, and a thirdinterrogation position occupied by a second pooled nucleotide selectedfrom the group. The pooled nucleotide occupying the second interrogationposition comprises a nucleotide complementary to a correspondingnucleotide from the reference sequence when the second pooled probe andreference sequence are maximally aligned, and the pooled nucleotideoccupying the third interrogation position comprises a nucleotidecomplementary to a corresponding nucleotide from the reference sequencewhen the third pooled probe and the reference sequence are maximallyaligned. Standard IUPAC nomenclature is used for describing poolednucleotides.

In trellis tiling, an array comprises at least first, second and thirdcells, respectively occupied by first, second and third pooled probes,each according to the generic description above. However, the segment ofcomplementarity, location of interrogation positions, and selection ofpooled nucleotide at each interrogation position may or may not differbetween the three pooled probes subject to the following constraint. Oneof the three interrogation positions in each of the three pooled probesmust align with the same corresponding nucleotide in the referencesequence.

This interrogation position must be occupied by a N in one of the pooledprobes, and a different pooled nucleotide in each of the other twopooled probes.

In a seventh embodiment, the invention provides arrays for bridgetiling. Bridge tiling is a species of the general tiling strategiesnoted above, in which probes from the first probe set contain more thanone segment of complementarity.

In bridge tiling, a nucleotide in a reference sequence is usuallydetermined from a comparison of four probes. A first probe comprises atleast first and second segments, each of at least three nucleotides andeach exactly complementary to first and second subsequences of areference sequences. The segments including at least one interrogationposition corresponding to a nucleotide in the reference sequence.

Either (1) the first and second subsequences are noncontiguous in thereference sequence, or (2) the first and second subsequences arecontiguous and the first and second segments are inverted relative tothe first and second subsequences.

The arrays further comprises second, third and fourth probes, which areidentical to a sequence comprising the first probe or a subsequencethereof comprising at least three nucleotides from each of the first andsecond segments, except in the at least one interrogation position,which differs in each of the probes. In a species of bridge tiling,referred to as deletion tiling, the first and second subsequences areseparated by one or two nucleotides in the reference sequence.

In an eighth embodiment, the invention provides arrays of probes formultiplex tiling. Multiplex tiling is a strategy, in which the identityof two nucleotides in a target sequence is determined from a comparisonof the hybridization intensities of four probes, each having twointerrogation positions. Each of the probes comprising a segment of atleast 7 nucleotides that is exactly complementary to a subsequence froma reference sequence, except that the segment may or may not be exactlycomplementary at two interrogation positions. The nucleotides occupyingthe interrogation positions are selected by the following rules: (1) thefirst interrogation position is occupied by a different nucleotide ineach of the four probes, (2) the second interrogation position isoccupied by a different nucleotide in each of the four probes, (3) infirst and second probes, the segment is exactly complementary to thesubsequence, except at no more than one of the interrogation positions,(4) in third and fourth probes, the segment is exactly complementary tothe subsequence, except at both of the interrogation positions.

In a ninth embodiment, the invention provides arrays of immobilizedprobes including helper mutations. Helper mutations are useful for,e.g., preventing self-annealing of probes having inverted repeats. Inthis strategy, the identity of a nucleotide in a target sequence isusually determined from a comparison of four probes. A first probecomprises a segment of at least 7 nucleotides exactly complementary to asubsequence of a reference sequence except at one or two positions, thesegment including an interrogation position not at the one or twopositions. The one or two positions are occupied by helper mutations.

Second, third and fourth mutant probes are each identical to a sequencecomprising the wildtype probe or a subsequence thereof including theinterrogation position and the one or two positions, except in theinterrogation position, which is occupied by a different nucleotide ineach of the four probes.

In a tenth embodiment, the invention provides arrays of probescomprising at least two probe sets, but lacking a probe set comprisingprobes that are perfectly matched to a reference sequence. Such arraysare usually employed in methods in which both reference and targetsequence are hybridized to the array. The first probe set comprising aplurality of probes, each probe comprising a segment exactlycomplementary to a subsequence of at least 3 nucleotides of a referencesequence except at an interrogation position. The second probe setcomprises a corresponding probe for each probe in the first probe set,the corresponding probe in the second probe set being identical to asequence comprising the corresponding probe from the first probe set ora subsequence of at least three nucleotides thereof that includes theinterrogation position, except that the interrogation position isoccupied by a different nucleotide in each of the two correspondingprobes and the complement to the reference sequence.

In an eleventh embodiment, the invention provides methods of comparing atarget sequence with a reference sequence comprising a predeterminedsequence of nucleotides using any of the arrays described above. Themethods comprise hybridizing the target nucleic acid to an array anddetermining which probes, relative to one another, in the array bindspecifically to the target nucleic acid. The relative specific bindingof the probes indicates whether the target sequence is the same ordifferent from the reference sequence. In some such methods, the targetsequence has a substituted nucleotide relative to the reference sequencein at least one undetermined position, and the relative specific bindingof the probes indicates the location of the position and the nucleotideoccupying the position in the target sequence. In some methods, a secondtarget nucleic acid is also hybridized to the array. The relativespecific binding of the probes then indicates both whether the targetsequence is the same or different from the reference sequence, andwhether the second target sequence is the same or different from thereference sequence. In some methods, when the array comprises two groupsof probes tiled for first and second reference sequences, respectively,the relative specific binding of probes in the first group indicateswhether the target sequence is the same or different from the firstreference sequence. The relative specific binding of probes in thesecond group indicates whether the target sequence is the same ordifferent from the second reference sequence. Such methods areparticularly useful for analyzing heterologous alleles of a gene. Somemethods entail hybridizing both a reference sequence and a targetsequence to any of the arrays of probes described above. Comparison ofthe relative specific binding of the probes to the reference and targetsequences indicates whether the target sequence is the same or differentfrom the reference sequence.

In a twelfth embodiment, the invention provides arrays of immobilizedprobes in which the probes are designed to tile a reference sequencefrom a human immunodeficiency virus.

Reference sequences from either the reverse transcriptase gene orprotease gene of HIV are of particular interest. Some chips furthercomprise arrays of probes tiling a reference sequence from a 16S RNA orDNA encoding the 16S RNA from a pathogenic microorganism. The inventionfurther provides methods of using such arrays in analyzing a HIV targetsequence. The methods are particularly useful where the target sequencehas a substituted nucleotide relative to the reference sequence in atleast one position, the substitution conferring resistance to a drug usein treating a patient infected with a HIV virus. The methods reveal theexistence of the substituted nucleotide. The methods are alsoparticularly useful for analyzing a mixture of undetermined proportionsof first and second target sequences from different HIV variants. Therelative specific binding of probes indicates the proportions of thefirst and second target sequences.

In a thirteenth embodiment, the invention provides arrays of probestiled based on reference sequence from a CFTR gene. A preferred arraycomprises at least a group of probes comprising a wildtype probe, andfive sets of three mutant probes. The wildtype probe is exactlycomplementary to a subsequence of a reference sequence from a cysticfibrosis gene, the segment having at least five interrogation positionscorresponding to five contiguous nucleotides in the reference sequence.The probes in the first set of three mutant probes are each identical tothe wildtype probe, except in a first of the five interrogationpositions, which is occupied by a different nucleotide in each of thethree mutant probes and the wildtype probe. The probes in the second setof three mutant probes are each identical to the wildtype probe, exceptin a second of the five interrogation positions, which is occupied by adifferent nucleotide in each of the three mutant probes and the wildtypeprobe. The probes in the third set of three mutant probes are eachidentical to the wildtype probe, except in a third of the fiveinterrogation positions, which is occupied by a different nucleotide ineach of the three mutant probes and the wildtype probe. The probes inthe fourth set of three mutant probes are each identical to the wildtypeprobe, except in a fourth of the five interrogation positions, which isoccupied by a different nucleotide in each of the three mutant probesand the wildtype probe. The probes in the fifth set of three mutantprobes are each identical to the wildtype probe, except in a fifth ofthe five interrogation positions, which is occupied by a differentnucleotide in each of the three mutant probes and the wildtype probe.Preferably, a chip comprises two such groups of probes. The first groupcomprises a wildtype probe exactly complementary to a first referencesequence, and the second group comprises a wildtype probe exactlycomplementary to a second reference sequence that is a mutated form ofthe first reference sequence.

The invention further provides methods of using the arrays of theinvention for analyzing target sequences from a CFTR gene. The methodsare capable of simultaneously analyzing first and second targetsequences representing heterozygous alleles of a CFTR gene.

In a fourteenth embodiment, the invention provides arrays of probestiling a reference sequence from a p53 gene, an hMLHI gene and/or anMSH2 gene. The invention further provides methods of using the arraysdescribed above to analyze these genes. The method are useful, e.g., fordiagnosing patients susceptible to developing cancer.

In a fifteenth embodiment, the invention provides arrays of probestiling a reference sequence from a mitochondrial genome. The referencesequence may comprise part or all of the D-loop region, or all, orsubstantially all, of the mitochondrial genome. The invention furtherprovides method of using the arrays described above to analyze targetsequences from a mitochondrial genome. The methods are useful foridentifying mutations associated with disease, and for forensic,epidemiological and evolutionary studies.

1.2.1.1.3. Specific Strategies for Utilizing Nucleic Acid Arrays

The invention provides a number of strategies for comparing apolynucleotide of known sequence (a reference sequence) with variants ofthat sequence (target sequences).

The comparison can be performed at the level of entire genomes,chromosomes, genes, exons or introns, or can focus on individual mutantsites and immediately adjacent bases. The strategies allow detection ofvariations, such as mutations or polymorphisms, in the target sequenceirrespective whether a particular variant has previously beencharacterized. The strategies both define the nature of a variant andidentify its location in a target sequence.

The strategies employ arrays of oligonucleotide probes immobilized to asolid support. Target sequences are analyzed by determining the extentof hybridization at particular probes in the array. The strategy inselection of probes facilitates distinction between perfectly matchedprobes and probes showing single-base or other degrees of mismatches.

The strategy usually entails sampling each nucleotide of interest in atarget sequence several times, thereby achieving a high degree ofconfidence in its identity. This level of confidence is furtherincreased by sampling of adjacent nucleotides in the target sequence tonucleotides of interest.

The number of probes on the chip can be quite large (e.g., 10⁵-10⁶).However, usually only a small proportion of the total number of probesof a given length are represented.

Some advantage of the use of only a small proportion of all possibleprobes of a given length include: (i) each position in the array ishighly informative, whether or not hybridization occurs; (ii)nonspecific hybridization is minimized; (iii) it is straightforward tocorrelate hybridization differences with sequence differences,particularly with reference to the hybridization pattern of a knownstandard; and (iv) the ability to address each probe independentlyduring synthesis, using high resolution photolithography, allows thearray to be designed and optimized for any sequence. For example thelength of any probe can be varied independently of the others.

The present tiling strategies result in sequencing and comparisonmethods suitable for routine large-scale practice with a high degree ofconfidence in the sequence output.

1.2.1.1.4. General Tiling Strategies

1.1.1.1.4.1. Selection of Reference Sequence

The chips are designed to contain probes exhibiting complementarity toone or more selected reference sequence whose sequence is known. Thechips are used to read a target sequence comprising either the referencesequence itself or variants of that sequence. Target sequences maydiffer from the reference sequence at one or more positions but show ahigh overall degree of sequence identity with the reference sequence(e.g., at least 75, 90, 95, 99, 99.9 or 99-99%). Any polynucleotide ofknown sequence can be selected as a reference sequence. Referencesequences of interest include sequences known to include mutations orpolymorphisms associated with phenotypic changes having clinicalsignificance in human patients. For example, the CFTR gene and P53 genein humans have been identified as the location of several mutationsresulting in cystic fibrosis or cancer respectively. Other referencesequences of interest include those that serve to identify pathogenicmicroorganisms and/or are the site of mutations by which suchmicroorganisms acquire drug resistance (e.g., the HIV reversetranscriptase gene). Other reference sequences of interest includeregions where polymorphic variations are known to occur (e.g., theD-loop region of mitochondrial DNA). These reference sequences haveutility for, e.g., forensic or epidemiological studies. Other referencesequences of interest include p34 (related to p53), p65 (implicated inbreast, prostate and liver cancer), and DNA segments encodingcytochromes P450 (see Meyer et al., Pharmac. Ther. 46, 349-355 (1990)).Other reference sequences of interest include those from the genome ofpathogenic viruses (e.g., hepatitis J, B, or Q, herpes virus (e.g., VZV,HSV-1, HAV-6, HSV-II, and CMV, Epstein Barr virus), adenovirus,influenza virus, flaviviruses, echovirus, rhinovirus, coxsackie virus,cornovirus, respiratory syncytial virus, mumps virus, rotavirus, measlesvirus, rubella virus, parvovirus, vaccinia virus, HTLV virus, denguevirus, papillomavirus, molluscum virus, poliovirus, rabies virus, JCvirus and arboviral encephalitis virus. Other reference sequences ofinterest are from genomes or episomes of pathogenic bacteria,particularly regions that confer drug resistance or allow phylogeniccharacterization of the host (e.g., 16S rRNA or corresponding DNA). Forexample, such bacteria include chlanydia, rickettsial bacteria,mycobacteria, staphylococci, treptocci, pneumonococci, meningococci andconococci, klebsiella, proteus, serratia, pseudomonas, legionella,diphtheria, salmonella, bacilli, cholera, tetanus, botulism, anthrax,plague, leptospirosis, and Lymes disease bacteria. Other referencesequences of interest include those in which mutations result in thefollowing autosomal recessive disorders: sickle cell anemia,beta-thalassemia, phenylketonuria, galactosemia, Wilson's disease,hemochromatosis, severe combined immunodeficiency, alpha-1-antitrypsindeficiency, albinism, alkaptonuria, lysosomal storage diseases andEhlers-Danlos syndrome. Other reference sequences of interest includethose in which mutations result in X-linked recessive disorders:hemophilia, glucose-6-phosphate dehydrogenase, agammaglobulimenia,diabetes insipidus, Lesch-Nyhan syndrome, muscular dystrophy,Wiskott-Aldrich syndrome, Fabry's disease and fragile X-syndrome. Otherreference sequences of interest includes those in which mutations resultin the following autosomal dominant disorders: familialhypercholesterolemia, polycystic kidney disease, Huntingdon's disease,hereditary spherocytosis, Marfan's syndrome, von Willebrand's disease,neurofibromatosis, tuberous sclerosis, hereditary hemorrhagictelangiectasia, familial colonic polyposis, Ehlers-Danlos syndrome,myotonic dystrophy, muscular dystrophy, osteogenesis imperfecta, acuteintermittent porphyria, and von Hippel-Lindau disease.

The length of a reference sequence can vary widely from a full-lengthgenome, to an individual chromosome, episome, gene, component of a gene,such as an exon, intron or regulatory sequences, to a few nucleotides. Areference sequence of between about 2, 5, 10, 20, 50, 100, 5000, 1000,5,000 or 10,000, 20,000 or 100,000 nucleotides is common.

Sometimes only particular regions of a sequence (e.g., exons of a gene)are of interest. In such situations, the particular regions can beconsidered as separate reference sequences or can be considered ascomponents of a single reference sequence, as matter of arbitrarychoice.

A reference sequence can be any naturally occurring, mutant, consensusor purely hypothetical sequence of nucleotides, RNA or DNA. For example,sequences can be obtained from computer data bases, publications or canbe determined or conceived de novo. Usually, a reference sequence isselected to show a high degree of sequence identity to envisaged targetsequences. Often, particularly, where a significant degree of divergenceis anticipated between target sequences, more than one referencesequence is selected. Combinations of wildtype and mutant referencesequences are employed in several applications of the tiling strategy.

1.2.1.1.5. Chip Design

1.2.1.1.5.1. Basic Tiling Strategy

The basic tiling strategy provides an array of immobilized probes foranalysis of target sequences showing a high degree of sequence identityto one or more selected reference sequences. The strategy is firstillustrated for an array that is subdivided into four probe sets,although it will be apparent that in some situations, satisfactoryresults are obtained from only two probe sets. A first probe setcomprises a plurality of probes exhibiting perfect complementarity witha selected reference sequence. The perfect complementarity usuallyexists throughout the length of the probe. However, probes having asegment or segments of perfect complementarity that is/are flanked byleading or trailing sequences lacking complementarity to the referencesequence can also be used. Within a segment of complementarity, eachprobe in the first probe set has at least one interrogation positionthat corresponds to a nucleotide in the reference sequence. That is, theinterrogation position is aligned with the corresponding nucleotide inthe reference sequence, when the probe and reference sequence arealigned to maximize complementarity between the two. If a probe has morethan one interrogation position, each corresponds with a respectivenucleotide in the reference sequence. The identity of an interrogationposition and corresponding nucleotide in a particular probe in the firstprobe set cannot be determined simply by inspection of the probe in thefirst set. As will become apparent, an interrogation position andcorresponding nucleotide is defined by the comparative structures ofprobes in the first probe set and corresponding probes from additionalprobe sets.

In principle, a probe could have an interrogation position at eachposition in the segment complementary to the reference sequence.Sometimes, interrogation positions provide more accurate data whenlocated away from the ends of a segment of complementarity. Thus,typically a probe having a segment of complementarity of length x doesnot contain more than x-2 interrogation positions. Since probes aretypically 9-21 nucleotides, and usually all of a probe is complementary,a probe typically has 1-19 interrogation positions. Often the probescontain a single interrogation position, at or near the center of probe.

For each probe in the first set, there are, for purposes of the presentillustration, three corresponding probes from three additional probesets. Thus, there are four probes corresponding to each nucleotide ofinterest in the reference sequence. Each of the four correspondingprobes has an interrogation position aligned with that nucleotide ofinterest. Usually, the probes from the three additional probe sets areidentical to the corresponding probe from the first probe set with oneexception. The exception is that at least one (and often only one)interrogation position, which occurs in the same position in each of thefour corresponding probes from the four probe sets, is occupied by adifferent nucleotide in the four probe sets. For example, for an Anucleotide in the reference sequence, the corresponding probe from thefirst probe set has its interrogation position occupied by a T, and thecorresponding probes from the additional three probe sets have theirrespective interrogation positions occupied by A, C, or G, a differentnucleotide in each probe. Of course, if a probe from the first probe setcomprises trailing or flanking sequences lacking complementarity to thereference sequences, these sequences need not be present incorresponding probes from the three additional sets. Likewisecorresponding probes from the three additional sets can contain leadingor trailing sequences outside the segment of complementarity that arenot present in the corresponding probe from the first probe set.Occasionally, the probes from the additional three probe set areidentical (with the exception of interrogation position(s)) to acontiguous subsequence of the full complementary segment of thecorresponding probe from the first probe set. In this case, thesubsequence includes the interrogation position and usually differs fromthe full-length probe only in the omission of one or both terminalnucleotides from the termini of a segment of complementarity.

That is, if a probe from the first probe set has a segment ofcomplementarity of length n, corresponding probes from the other setswill usually include a subsequence of the segment of at least lengthn-2. Thus, the subsequence is usually at least 3, 4, 7, 9, 15, 21, or 25nucleotides long, most typically, in the range of 9-21 nucleotides. Thesubsequence should be sufficiently long to allow a probe to hybridizedetectably more strongly to a variant of the reference sequence mutatedat the interrogation position than to the reference sequence.

The probes can be oligodeoxyribonucleotides or oligoribonucleotides, orany modified forms of these polymers that are capable of hybridizingwith a target nucleic sequence by complementary base-pairing.Complementary base pairing means sequence-specific base pairing whichincludes e.g., Watson-Crick base pairing as well as other forms of basepairing such as Hoogsteen base pairing. Modified forms include2□-0-methyl oligoribonucleotides and so-called PNAs, in whicholigodeoxyribonucleotides are linked via peptide bonds rather thanphophodiester bonds. The probes can be attached by any linkage to asupport (e.g., 3□, 5□ or via the base). 3 □ attachment is more usual asthis orientation is compatible with the preferred chemistry for solidphase synthesis of oligonucleotides.

The number of probes in the first probe set (and as a consequence thenumber of probes in additional probe sets) depends on the length of thereference sequence, the number of nucleotides of interest in thereference sequence and the number of interrogation positions per probe.In general, each nucleotide of interest in the reference sequencerequires the same interrogation position in the four sets of probes.

Consider, as an example, a reference sequence of 100 nucleotides, 50 ofwhich are of interest, and probes each having a single interrogationposition. In this situation, the first probe set requires fifty probes,each having one interrogation position corresponding to a nucleotide ofinterest in the reference sequence. The second, third and fourth probesets each have a corresponding probe for each probe in the first probeset, and so each also contains a total of fifty probes. The identity ofeach nucleotide of interest in the reference sequence is determined bycomparing the relative hybridization signals at four probes havinginterrogation positions corresponding to that nucleotide from the fourprobe sets.

In some reference sequences, every nucleotide is of interest. In otherreference sequences, only certain portions in which variants (e.g.,mutations or polymorphisms) are concentrated are of interest. In otherreference sequences, only particular mutations or polymorphisms andimmediately adjacent nucleotides are of interest. Usually, the firstprobe set has interrogation positions selected to correspond to at leasta nucleotide (e.g., representing a point mutation) and one immediatelyadjacent nucleotide. Usually, the probes in the first set haveinterrogation positions corresponding to at least 3, 1.0, 50, 100, 1000,or 20,000 contiguous nucleotides. The probes usually have interrogationpositions corresponding to at least 5, 10, 30, 50, 75, 90, 99 orsometimes 100% of the nucleotides in a reference sequence.

Frequently, the probes in the first probe set completely span thereference sequence and overlap with one another relative to thereference sequence. For example, in one common arrangement each probe inthe first probe set differs from another probe in that set by theomission of a 3□ base complementary to the reference sequence and theacquisition of a 5□ base complementary to the reference sequence.

For conceptual simplicity, the probes in a set are usually arranged inorder of the sequence in a lane across the chip. A lane contains aseries of overlapping probes, which represent or tile across, theselected reference sequence. The components of the four sets of probesare usually laid down in four parallel lanes, collectively constitutinga row in the horizontal direction and a series of 4-member columns inthe vertical direction. Corresponding probes from the four probe sets(i.e., complementary to the same subsequence of the reference sequence)occupy a column.

Each probe in a lane usually differs from its predecessor in the lane bythe omission of a base at one end and the inclusion of additional baseat the other end. However, this orderly progression of probes can beinterrupted by the inclusion of control probes or omission of probes incertain columns of the array. Such columns serve as controls to orientthe chip, or gauge the background, which can include target sequencenonspecifically bound to the chip.

The probes sets are usually laid down in lanes such that all probeshaving an interrogation position occupied by an A form an-A-lane, allprobes having an interrogation position occupied by a C form a C-lane,all probes having an interrogation position occupied by a G form aG-lane, and all probes having an interrogation position occupied by a T(or U) form a T lane (or a U lane). Note that in this arrangement thereis not a unique correspondence between probe sets and lanes. Thus, theprobe from the first probe set is laid down in the A-lane, C-lane,A-lane, A-lane and T-lane for the five columns. The interrogationposition on a column of probes corresponds to the position in the targetsequence whose identity is determined from analysis of hybridization tothe probes in that column. The interrogation position can be anywhere ina probe but is usually at or near the central position of the probe tomaximize differential hybridization signals between a perfect match anda single-base mismatch.

For example, for an 11 mer probe, the central position is the sixthnucleotide.

Although the array of probes is usually laid down in rows and columns asdescribed above, such a physical arrangement of probes on the chip isnot essential. Provided that the spatial location of each probe in anarray is known, the data from the probes can be collected apd processedto yield the sequence of a target irrespective of the physicalarrangement of the probes on a chip. In processing the data, thehybridization signals from the respective probes can be reassorted intoany conceptual array desired for subsequent data reduction whatever thephysical arrangement of probes on the chip.

A range of lengths of probes can be employed in the chips. As notedabove, a probe may consist exclusively of a complementary segments, ormay have one or more complementary segments juxtaposed by flanking,trailing and/or intervening segments. In the latter situation, the totallength of complementary segment(s) is more important than the length ofthe probe. In functional terms, the complementarity segment(s) of thefirst probe sets should be sufficiently long to allow the probe tohybridize detectably more strongly to a reference sequence compared witha variant of the reference including a single base mutation at thenucleotide corresponding to the interrogation position of the probe.

Similarly, the complementarity segment(s) in corresponding probes fromadditional probe sets should be sufficiently long to allow a probe tohybridize detectably more strongly to a variant of the referencesequence having a single nucleotide substitution at the interrogationposition relative to the reference sequence. A probe usually has asingle complementary segment having a length of at least 3 nucleotides,and more usually at least 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17,18, 19, 20, 21, 22, 23, 24, 25 or bases exhibiting perfectcomplementarity (other than possibly at the interrogation position(s)depending on the probe set) to the reference sequence. In bridgingstrategies, where more than one segment of complementarity is present,each segment provides at least three complementary nucleotides to thereference sequence and the combined segments provide at least twosegments of three or a total of six complementary nucleotides. As in theother strategies, the combined length of complementary segments istypically from 6-30 nucleotides, and preferably from about 9-21nucleotides. The two segments are often approximately the same length.Often, the probes (or segment of complementarity within probes) have anodd number of bases, so that an interrogation position can occur in theexact center of the probe.

In some chips, all probes are the same length. Other chips employdifferent groups of probe sets, in which case the probes are of the samesize within a group, but differ between different groups. For example,some chips have one group comprising four sets of probes as describedabove in which all the probes are 11 mers, together with a second groupcomprising four sets of probes in which all of the probes are 13 mers.Of course, additional groups of probes can be added.

Thus, some chips contain, e.g., four groups of probes having sizes of 11mers, 13 mers, 15 mers and 17 mers. Other chips have different sizeprobes within the same group of four probe sets. In these chips, theprobes in the first set can vary in length independently of each other.Probes in the other sets are usually the same length as the probeoccupying the same column from the first set. However, occasionallydifferent lengths of probes can be included at the same column positionin the four lanes. The different length probes are included to equalizehybridization signals from probes irrespective of whether A-T or C-Gbonds are formed at the interrogation position.

The length of probe can be important in distinguishing between aperfectly matched probe and probes showing a single-base mismatch withthe target sequence. The discrimination is usually greater for shortprobes. Shorter probes are usually also less susceptible to formation ofsecondary structures.

However, the absolute amount of target sequence bound, and hence thesignal, is greater for larger probes. The probe length representing theoptimum compromise between these competing considerations may varydepending on inter alia the GC content of a particular region of thetarget DNA sequence, secondary structure, synthesis efficiency andcross-hybridization. In some regions of the target, depending onhybridization conditions, short probes (e.g., 11 mers) may provideinformation that is inaccessible from longer probes (e.g., 19 mers) andvice versa. Maximum sequence information can be read by includingseveral groups of different sized probes on the chip as noted above.However, for many regions of the target sequence, such a strategyprovides redundant information in that the same sequence is readmultiple times from the different groups of probes. Equivalentinformation can be obtained from a single group of different sizedprobes in which the sizes are selected to maximize readable sequence atparticular regions of the target sequence. The strategy of customizingprobe length within a single group of probe sets minimizes the totalnumber of probes required to read a particular target sequence. Thisleaves ample capacity for the chip to include probes to other referencesequences.

The invention provides an optimization block which allows systematicvariation of probe length and interrogation position to optimize theselection of probes for analyzing a particular nucleotide in a referencesequence. The block comprises alternating columns of probescomplementary to the wildtype target and probes complementary to aspecific mutation. The interrogation position is varied between columnsand probe length is varied down a column.

Hybridization of the chip to the reference sequence or the mutant formof the reference sequence identifies the probe length and interrogationposition providing the greatest differential hybridization signal.

The probes are designed to be complementary to either strand of thereference sequence (e.g., coding or non-coding). some chips containseparate groups of probes, one complementary to the coding strand, theother complementary to the noncoding strand. Independent analysis ofcoding and noncoding strands provides largely redundant information.

However, the regions of ambiguity in reading the coding strand are notalways the same as those in reading the noncoding strand. Thus,combination of the information from coding and noncoding strandsincreases the overall accuracy of sequencing.

Some chips contain additional probes or groups of probes designed to becomplementary to a second reference sequence.

The second reference sequence is often a subsequence of the firstreference sequence bearing one or more commonly occurring mutations orinterstrain variations. The second group of probes is designed by thesame principles as described above except that the probes exhibitcomplementarity to the second reference sequence. The inclusion of asecond group is particular useful for analyzing short subsequences ofthe primary reference sequence in which multiple mutations are expectedto occur within a short distance commensurate with the length of theprobes (i.e., two or more mutations within 9 to 21 bases). Of course,the same principle can be extended to provide chips containing groups ofprobes for any number of reference sequences. Alternatively, the chipsmay contain additional probe(s) that do not form part of a tiled arrayas noted above, but rather serves as probe(s) for a conventional reversedot blot. For example, the presence of mutation can be detected frombinding of a target sequence to a single oligomeric probe harboring themutation. Preferably, an additional probe containing the equivalentregion of the wildtype sequence is included as a control.

The chips are read by comparing the intensities of labelled target boundto the probes in an array.

Specifically, a comparison is performed between each lane of probes(e.g., A, C, G and T lanes) at each columnar position (physical orconceptual). For a particular columnar position, the lane showing thegreatest hybridization signal is called as the nucleotide present at theposition in the target sequence corresponding to the interrogationposition in the probes. The corresponding position in the targetsequence is that aligned with the interrogation position incorresponding probes when the probes and target are aligned to maximizecomplementarity. Of the four probes in a column, only one can exhibit aperfect match to the target sequence whereas the others usually exhibitat least a one base pair mismatch. The probe exhibiting a perfect matchusually produces a substantially greater hybridization signal than theother three probes in the column and is thereby easily identified.However, in some regions of the target sequence, the distinction betweena perfect match and a one-base mismatch is less clear. Thus, a callratio is established to define the ratio of signal from the besthybridizing probes to the second best hybridizing probe that must beexceeded for a particular target position to be read from the probes. Ahigh call ratio ensures that few if any errors are made in callingtarget nucleotides, but can result in some nucleotides being scored asambiguous, which could in fact be accurately read.

A lower call ratio results in fewer ambiguous calls, but can result inmore erroneous calls. It has been found that at a call ratio of 1.2virtually all calls are accurate. However, a small but significantnumber of bases (e.g., up to about %) may have to be scored asambiguous.

Although small regions of the target sequence can sometimes beambiguous, these regions usually occur at the same or similar segmentsin different target sequences. Thus, for precharacterized mutations, itis known in advance whether that mutation is likely to occur within aregion of unambiguously determinable sequence.

An array of probes is most useful for analyzing the reference sequencefrom which the probes were designed and variants of that sequenceexhibiting substantial sequence similarity with the reference sequence(e.g., several single-base mutants spaced over the reference sequence).When an array is used to analyze the exact reference sequence from whichit was designed, one probe exhibits a perfect match to the referencesequence, and the other three probes in the same column exhibitssingle-base mismatches. Thus, discrimination between hybridizationsignals is usually high and accurate sequence is obtained. High accuracyis also obtained when an array is used for analyzing a target sequencecomprising a variant of the reference sequence that has a singlemutation relative to the reference sequence, or several widely spacedmutations relative to the reference sequence. At different mutant loci,one probe exhibits a perfect match to the target, and the other threeprobes occupying the same column exhibit single-base mismatches, thedifference (with respect to analysis of the reference sequence) beingthe lane in which the perfect match occurs.

For target sequences showing a high degree of divergence from thereference strain or incorporating several closely spaced mutations fromthe reference strain, a single group of probes (i.e., designed withrespect to a single reference sequence) will not always provide accuratesequence for the highly variant region of this sequence. At someparticular columnar positions, it may be that no single probe exhibitsperfect complementarity to the target and that any comparison must bebased on different degrees of mismatch between the four probes. Such acomparison does not always allow the target nucleotide corresponding tothat columnar position to be called. Deletions in target sequences canbe detected by loss of signal from probes having interrogation positionsencompassed by the deletion. However, signal may also be lost fromprobes having interrogation positions closely proximal to the deletionresulting in some regions of the target sequence that cannot be read.Target sequence bearing insertions will also exhibit short regionsincluding and proximal to the insertion that usually cannot be read.

The presence of short regions of difficult-to-read target because ofclosely spaced mutations, insertions or deletion, does not preventdetermination of the remaining sequence of the target as differentregions of a target sequence are determined independently. Moreover,such ambiguities as might result from analysis of diverse variants witha single group of probes can be avoided by including multiple groups ofprobe sets on a chip. For example, one group of probes can be designedbased on a full-length reference sequence, and the other groups onsubsequences of the reference sequence incorporating frequentlyoccurring mutations or strain variations.

A particular advantage of the present sequencing strategy overconventional sequencing methods is the capacity simultaneously to detectand quantify proportions of multiple target sequences. Such capacity isvaluable, e.g., for diagnosis of patients who are heterozygous withrespect to a gene or who are infected with a virus, such as HIV, whichis usually present in several polymorphic forms. Such capacity is alsouseful in analyzing targets from biopsies of tumor cells and surroundingtissues. The presence of multiple target sequences is detected from therelative signals of the four probes at the array columns correspondingto the target nucleotides at which diversity occurs. The relativesignals at the four probes for the mixture under test are compared withthe corresponding signals from a homogeneous reference sequence. Anincrease in a signal from a probe that is mismatched with respect to thereference sequence, and a corresponding decrease in the signal from theprobe which is matched with the reference sequence signal the presenceof a mutant strain in the mixture. The extent in shift in hybridizationsignals of the probes is related to the proportion of a target sequencein the mixture. Shifts in relative hybridization signals can bequantitatively related to proportions of reference and mutant sequenceby prior calibration of the chip with seeded mixtures of the mutant andreference sequences. By this means, a chip can be used to detect variantor mutant strains constituting as little as 1, 5, 20, or 25% of amixture of stains.

Similar principles allow the simultaneous analysis of multiple targetsequences even when none is identical to the reference sequence. Forexample, with a mixture of two target sequences bearing first and secondmutations, there would be a variation in the hybridization patterns ofprobes having interrogation positions corresponding to the first andsecond mutations relative to the hybridization pattern with thereference sequence. At each position, one of the probes having amismatched interrogation position relative to the reference sequencewould show an increase in hybridization signal, and the probe having amatched interrogation position relative to the reference sequence wouldshow a decrease in hybridization signal. Analysis of the hybridizationpattern of the mixture of mutant target sequences, preferably incomparison with the hybridization pattern of the reference sequence,indicates the presence of two mutant target sequences, the position andnature of the mutation in each strain, and the relative proportions ofeach strain.

In a variation of the above method, the different components in amixture of target sequences are differentially labelled before beingapplied to the array. For example, a variety of fluorescent labelsemitting at different wavelength are available. The use of differentiallabels allows independent analysis of different targets boundsimultaneously to the array. For example, the methods permit comparisonof target sequences obtained from a patient at different stages of adisease.

1.2.1.1.5.2. Omission of Probes

The general strategy outlined above employs four probes to read eachnucleotide of interest in a target sequence. One probe (from the firstprobe set) shows a perfect match to the reference sequence and the otherthree probes (from the second, third and fourth probe sets) exhibit amismatch with the reference sequence and a perfect match with a targetsequence bearing a mutation at the nucleotide of interest.

The provision of three probes from the second, third and fourth probesets allows detection of each of the three possible nucleotidesubstitutions of any nucleotide of interest. However, in some referencesequences or regions of reference sequences, it is known in advance thatonly certain mutations are likely to occur. Thus, for example, at onesite it might be known that an A nucleotide in the reference sequencemay exist as a T mutant in some target sequences but is unlikely toexist as a C or G mutant. Accordingly, for analysis of this region ofthe reference sequence, one might include only the first and secondprobe sets, the first probe set exhibiting perfect complementarity tothe reference sequence, and the second probe set having an interrogationposition occupied by an invariant A residue (for detecting the Tmutant). In other situations, one might include the first, second andthird probes sets (but not the fourth) for detection of a wildtypenucleotide in the reference sequence and two mutant variants thereof intarget sequences. In some chips, probes that would detect silentmutations (i.e., not affecting amino acid sequence) are omitted.

In some chips, the probes from the first probe set are omittedcorresponding to some or all positions of the reference sequences. Suchchips comprise at least two probe sets. The first probe set has aplurality of probes. Each probe comprises a segment exactlycomplementary to a subsequence of a reference sequence except in atleast one interrogation position. A second probe set has a correspondingprobe for each probe in the first probe set.

The corresponding probe in the second probe set is identical to asequence comprising the corresponding probe form the first probe set ora subsequence thereof that includes the at least one (and usually onlyone) interrogation position except that the at least one interrogationposition is occupied by a different nucleotide in each of the twocorresponding probes from the first and second probe sets. A third probeset, if present, also comprises a corresponding probe for each probe inthe first probe set except at the at least one interrogation position,which differs in the corresponding probes from the three sets. Omissionof probes having a segment exhibiting perfect complementarity to thereference sequence results in loss of control information, i.e., thedetection of nucleotides in a target sequence that are the same As thosein a reference sequence. However, similar information can be obtained byhybridizing a chip lacking probes from the first probe set to bothtarget and reference sequences. The hybridization can be performedsequentially, or concurrently, if the target and reference aredifferentially labelled. In this situation, the presence of a mutationis detected by a shift in the background hybridization intensity of thereference sequence to a perfectly matched hybridization signal of thetarget sequence, rather than by a comparison of the hybridizationintensities of probes from the first set with corresponding probes fromthe second, third and fourth sets.

1.2.1.1.5.3. Wildtype Probe Lane

When the chips comprise four probe sets, as discussed supra, and theprobe sets are laid down in four lanes, an A-lane, a C-lane, a G-laneand a T or U-lane, the probe having a segment exhibiting perfectcomplementarity to a reference sequence varies between the four lanesfrom one column to another. This does not present any significantdifficulty in computer analysis of the data from the chip. However,visual inspection of the hybridization pattern of the chip is sometimesfacilitated by provision of an extra lane of probes, in which each probehas a segment exhibiting perfect complementarity to the referencesequence. This segment-is identical to a segment from one of the probesin the other four lanes (which lane depending on the column position).The extra lane of probes (designated the wildtype lane) hybridizes to atarget sequence at all nucleotide positions except those in whichdeviations from the reference sequence occurs. The hybridization patternof the wildtype lane thereby provides a simple visual indication ofmutations.

1.2.1.1.5.4. Deletion, Insertion and Multiple-Mutation Probes

Some chips provide an additional probe set specifically designed foranalyzing deletion mutations. The additional probe set comprises a probecorresponding to each probe in the first probe set as described above.However, a probe from the additional probe set differs from thecorresponding probe in the first probe set in that the nucleotideoccupying the interrogation position is deleted in the probe from theadditional probe set. Optionally, the probe from the additional probeset bears an additional nucleotide at one of its termini relative to thecorresponding probe from the first probe set. The probe from theadditional probe set will hybridize more strongly than the correspondingprobe from the first probe set to a target sequence having a single basedeletion at the nucleotide corresponding to the interrogation position.Additional probe sets are provided in which not only the interrogationposition, but also an adjacent nucleotide is detected.

Similarly, other chips provide additional probe sets for analyzinginsertions. For example, one additional probe set has a probecorresponding to each probe in the first probe set as described above.However, the probe in the additional probe set has an extra T nucleotideinserted adjacent to the interrogation position. Optionally, the probehas one fewer nucleotide at one of its termini relative to thecorresponding probe from the first probe set. The probe from theadditional probe set hybridizes more strongly than the correspondingprobe from the first probe set to a target sequence having an Anucleotide inserted in a position adjacent to that corresponding to theinterrogation position.

Similar additional probe sets are constructed having C, G or T/Unucleotides inserted adjacent to the interrogation position. Usually,four such probe sets, one for each nucleotide, are used in combination.

Other chips provide additional probes (multiple-mutation probes) foranalyzing target sequences having multiple closely spaced mutations. Amultiple-mutation probe is usually identical to a corresponding probefrom the first set as described above, except in the base occupying theinterrogation position, and except at one or more additional positions,corresponding to nucleotides in which substitution may occur in thereference sequence. The one or more additional positions in the multiplemutation probe are occupied by nucleotides complementary to thenucleotides occupying corresponding positions in the reference sequencewhen the possible substitutions have occurred.

1.2.1.1.5.5. Block Tiling

As noted in the discussion of the general tiling strategy, a probe inthe first probe set sometimes has more than one interrogation position.In this situation, a probe in the first probe set is sometimes matchedwith multiple groups of at least one, and usually, three additionalprobe sets. Three additional probe sets are used to allow detection ofthe three possible nucleotide substitutions at any one position. If onlycertain types of substitution are likely to occur (e.g., transitions),only one or two additional probe sets are required (analogous to the useof probes in the basic tiling strategy). To illustrate for the situationwhere a group comprises three additional probe sets, a first such groupcomprises second, third and fourth probe sets, each of which has a probecorresponding to each probe in the first probe set. The correspondingprobes from the second, third and fourth probes sets differ from thecorresponding probe in the first set at a first of the interrogationpositions. Thus, the relative hybridization signals from correspondingprobes from the first, second, third and fourth probe sets indicate theidentity of the nucleotide in a target sequence corresponding to thefirst interrogation position. A second group of three probe sets(designated fifth, sixth and seventh probe sets), each also have a probecorresponding to each probe in the first probe set. These correspondingprobes differ from that in the first probe set at a second interrogationposition. The relative hybridization signals from corresponding probesfrom the first, fifth, sixth, and seventh probe sets indicate theidentity of the nucleotide in the target sequence corresponding to thesecond interrogation position. As noted above, the probes in the firstprobe set often have seven or more interrogation positions. If there areseven interrogation positions, there are seven groups of threeadditional probe sets, each group of three probe sets serving toidentify the nucleotide corresponding to one of the seven interrogationpositions.

Each block of probes allows short regions of a target sequence to beread. For example, for a block of probes having seven interrogationpositions, seven nucleotides in the target sequence can be read. Ofcourse, a chip can contain any number of blocks depending on how manynucleotides of the target are of interest. The hybridization signals foreach block can be analyzed independently of any other block. The blocktiling strategy can also be combined with other tiling strategies, withdifferent parts of the same reference sequence being tiled by differentstrategies.

The block tiling strategy offers two advantages over the basic strategyin which each probe in the first set has a single interrogationposition. One advantage is that the same sequence information can beobtained from fewer probes. A second advantage is that each of theprobes constituting a block (i.e., a probe from the first probe set anda corresponding probe from each of the other probe sets) can haveidentical 3 □ and 5 □ sequences, with the variation confined to acentral segment containing the interrogation positions. The identity of3 □ sequence between different probes simplifies the strategy for solidphase synthesis of the probes on the chip and results in more uniformdeposition of the different probes on the chip, thereby in turnincreasing the uniformity of signal to noise ratio for different regionsof the chip. A third advantage is that greater signal uniformity isachieved within a block.

1.2.1.1.5.6. Multiplex Tiling

In the block tiling strategy discussed above, the identity of anucleotide in a target or reference sequence is determined by comparisonof hybridization patterns of one probe having a segment showing aperfect match with that of other probes (usually three other probes)showing a single base mismatch. In multiplex tiling, the identity of atleast two nucleotides in a reference or target sequence is determined bycomparison of hybridization signal intensities of four probes, two ofwhich have a segment showing perfect complementarity or a single basemismatch to the reference sequence, and two of which have a segmentshowing perfect complementarity or a double-base mismatch to a segment.The four probes whose hybridization patterns are to be compared eachhave a segment that is exactly complementary to a reference sequenceexcept at two interrogation positions, in which the segment may or maynot be complementary to the reference sequence. The interrogationpositions correspond to the nucleotides in a reference or targetsequence which are determined by the comparison of intensities. Thenucleotides occupying the interrogation positions in the four probes areselected according to the following rule. The first interrogationposition is occupied by a different nucleotide in each of the fourprobes. The second interrogation position is also occupied by adifferent nucleotide in each of the four probes. In two of the fourprobes, designated the first and second probes, the segment is exactlycomplementary to the reference sequence except at not more than one ofthe two interrogation positions. In other words, one of theinterrogation positions is occupied by a nucleotide that iscomplementary to the corresponding nucleotide from the referencesequence and the other interrogation position may or may not be sooccupied. In the other two of the four probes, designated the third andfourth probes, the segment is exactly complementary to the referencesequence except that both interrogation positions are occupied bynucleotides which are noncomplementary to the respective correspondingnucleotides in the reference sequence.

There are number of ways of satisfying these conditions depending onwhether the two nucleotides in the reference sequence corresponding tothe two interrogation positions are the same or different. If these twonucleotides are different in the reference sequence (probability ¾), theconditions are satisfied by each of the two interrogation positionsbeing occupied by the same nucleotide in any given probe. For example,in the first probe, the two interrogation positions would both be A, inthe second probe, both would be C, in the third probe, each would be G,and in the fourth probe each would be T or U. If the two nucleotides inthe reference sequence corresponding to the two interrogation positionsare different, the conditions noted above are satisfied by each of theinterrogation positions in any one of the four probes being occupied bycomplementary nucleotides. For example, in the first probe, theinterrogation positions could be occupied by A and T, in the secondprobe by C and G, in the third probe by G and C and in the four probe,by T and A.

When the four probes are hybridized to a target that is the same as thereference sequence or differs from the reference sequence at one (butnot both) of the interrogation positions, two of the four probes show adouble-mismatch with the target and two probes show a single mismatch.The identity of probes showing these different degrees of mismatch canbe determined from the different hybridization signals.

From the identity of the probes showing the different degrees ofmismatch, the nucleotides occupying both of the interrogation positionsin the target sequence can be deduced.

For ease of illustration, the multiplex strategy has been initiallydescribed for the situation where there are two nucleotides of interestin a reference sequence and only four probes in an array. Of course, thestrategy can be extended to analyze any number of nucleotides in atarget sequence by using additional probes. In one variation, each pairof interrogation positions is read from a unique group of four probes.In a block variation, different groups of four probes exhibit the samesegment of complementarity with the reference sequence, but theinterrogation positions move within a block.

The block and standard multiplex tiling variants can of course be usedin combination for different regions of a reference sequence. Either orboth variants can also be used in combination with any of the othertiling strategies described.

1.2.1.1.5.7. Helper Mutations

Occasionally small regions of a reference sequence give a lowhybridization signal as a result of annealing of probes.

The self-annealing reduces the amount of probe effectively available forhybridizing to the target. Although such regions of the target aregenerally small and the reduction of hybridization signal is usually notso substantial as to obscure the sequence of this region, this concerncan be avoided by the use of probes incorporating helper mutations.

The helper mutation(s) serve to break-up regions of internalcomplementarity within a probe and thereby prevent annealing.

Usually, one or two helper mutations are quite sufficient for thispurpose. The inclusion of helper mutations can be beneficial in any ofthe tiling strategies noted above. In general each probe having aparticular interrogation position has the same helper mutation(s). Thus,such probes have a segment in common which shows perfect complementaritywith a reference sequence, except that the segment contains at least onehelper mutation (the same in each of the probes) and at least oneinterrogation position (different in all of the probes). For example, inthe basic tiling strategy, a probe from the first probe set comprises asegment containing an interrogation position and showing perfectcomplementarity with a reference sequence except for one or two helpermutations. The corresponding probes from the second, third and fourthprobe sets usually comprise the same segment (or sometimes a subsequencethereof including the helper mutation(s) and interrogation position),except that the base occupying the interrogation position varies in eachprobe.

Usually, the helper mutation tiling strategy is used in conjunction withone of the tiling strategies described above.

The probes containing helper mutations are used to tile regions of areference sequence otherwise giving low hybridization signal (e.g.,because of self-complementarity), and the alternative tiling strategy isused to tile intervening regions.

1.2.1.1.5.8. Pooling Strategies

Pooling strategies also employ arrays of immobilized probes. Probes areimmobilized in cells of an array, and the hybridization signal of eachcell can be determined independently of any other cell. A particularcell may be occupied by pooled mixture of probes. Although the identityof each probe in the mixture is known, the individual probes in the poolare not separately addressable. Thus, the hybridization signal from acell is the aggregate of that of the different probes occupying thecell. In general, a cell is scored as hybridizing to a target sequenceif at least one probe occupying the cell comprises a segment exhibitingperfect complementarity to the target sequence.

A simple strategy to show the increased power of pooled strategies overa standard tiling is to create three cells each containing a pooledprobe having a single pooled position, the pooled position being thesame in each of the pooled probes. At the pooled position, there are twopossible nucleotides, allowing the pooled probe to hybridize to twotarget sequences. In tiling terminology, the pooled position of eachprobe is an interrogation position. As will become apparent, comparisonof the hybridization intensities of the pooled probes from the threecells reveals the identity of the nucleotide in the target sequencecorresponding to the interrogation position (i.e., that is matched withthe interrogation position when the target sequence and pooled probesare maximally aligned for complementarity).

The three cells are assigned probe pools that are perfectlycomplementary to the target except at the pooled position, which isoccupied by a different pooled nucleotide in each probe.

With 3 pooled probes, all 4 possible single base pair states (wild and 3mutants) are detected. A pool hybridizes with a target if some probecontained within that pool is complementary to that target.

A cell containing a pair (or more) of oligonucleotides lights up when atarget complementary to any of the oligonucleotide in the cell ispresent. Using the simple strategy, each of the four possible targets(wild and three mutants) yields a unique hybridization pattern among thethree cells.

Since a different pattern of hybridizing pools is obtained for eachpossible nucleotide in the target sequence corresponding to the pooledinterrogation position in the probes, the identity of the nucleotide canbe determined from the hybridization pattern of the pools. Whereas, astandard tiling requires four cells to detect and identify the possiblesingle-base substitutions at one location, this simple pooled 45strategy only requires three cells.

A more efficient pooling strategy for sequence analysis is the ‘Trellis’strategy. In this strategy, each pooled probe has a segment of perfectcomplementarity to a reference sequence except at three pooledpositions. One pooled position is an N pool. The three pooled positionsmay or may not be contiguous in a probe. The other two pooled positionsare selected from the group of three pools consisting of (1) M or K, (2)R or Y and (3) W or S, where the single letters are IUPAC standardambiguity codes. The sequence of a pooled probe is thus, of the formXXXN[(M/K) or (R/Y) or (W/S)][(M/K) or (R/Y) or (W/S)]XXXXX, where XXXrepresents bases complementary to the reference sequence. The threepooled positions may be in any order, and may be contiguous or separatedby intervening nucleotides. For, the two positions occupied by [(M/K) or(RN) or (W/S)], two choices must be made. First, one must select one ofthe following three pairs of pooled nucleotides (1) M/K, (2) R/Y and (3)W/S. The one of three pooled nucleotides selected may be the same ordifferent at the two pooled positions. Second, supposing, for example,one selects M/K at one position, one must then chose between M or K.This choice should result in selection of a pooled nucleotide comprisinga nucleotide that complements the corresponding nucleotide in areference sequence, when the probe and reference sequence are maximallyaligned. The same principle governs the selection between R and Y, andbetween W and S. A trellis pool probe has one pooled position with fourpossibilities, and two pooled positions, each with two possibilities.Thus, a trellis pool probe comprises a mixture of 16 (4×2×2) probes.Since each pooled position includes one nucleotide that complements thecorresponding nucleotide from the reference sequence, one of these 16probes has a segment that is the exact complement of the referencesequence. A target sequence that is the same as the reference sequence(i.e., a wildtype target) gives a hybridization signal to each probecell. Here, as in other tiling methods, the segment of complementarityshould be sufficiently long to permit specific hybridization of a pooledprobe to a reference sequence be detected relative to a variant of thatreference sequence. Typically, the segment of complementarity is about9-21 nucleotides.

A target sequence is analyzed by comparing hybridization intensities atthree pooled probes, each having the structure described above. Thesegments complementary to the reference sequence present in the threepooled probes show some overlap.

Sometimes the segments are identical (other than at the interrogationpositions). However, this need not be the case.

For example, the segments can tile across a reference sequence inincrements of one nucleotide (i.e., one pooled probe differs from thenext by the acquisition of one nucleotide at the 5 □ end and loss of anucleotide at the 3 □ end). The three interrogation positions may or maynot occur at the same relative positions within each pooled probe (i.e.,spacing from a probe terminus). All that is required is that one of thethree interrogation positions from each of the three pooled probesaligns with the same nucleotide in the reference sequence, and that thisinterrogation position is occupied by a different pooled nucleotide ineach of the three probes. In one of the three probes, the interrogationposition is occupied by an N. In the other two pooled probes theinterrogation position is occupied by one of (M/K) or (R/Y) or (W/S).

In the simplest form of the trellis strategy, three pooled probes areused to analyze a single nucleotide in the reference sequence. Muchgreater economy of probes is achieved when more pooled probes areincluded in an array.

For example, consider an array of five pooled probes each having thegeneral structure outlined above. Three of these pooled probes have aninterrogation position that aligns with the same nucleotide in thereference sequence and are used to read that nucleotide. A differentcombination of three probes have an interrogation position that alignswith a different nucleotide in the reference sequence. Comparison ofthese three probe intensities allows analysis of this second nucleotide.Still another combination of three pooled probes from the set of fivehave an interrogation position that aligns with a third nucleotide inthe reference sequence and these probes are used to analyze thatnucleotide. Thus, three nucleotides in the reference sequence are fullyanalyzed from only five pooled probes. By comparison, the basic tilingstrategy would require 12 probes for a similar analysis.

The trellis strategy employs an array of probes having at least threecells, each of which is occupied by a pooled probe as described above.

Consider the use of three such pooled probes for analyzing a targetsequence, of which one position may contain any single base substitutionto the reference sequence (i.e, there are four possible target sequencesto be distinguished).

Three cells are occupied by pooled probes having a pooled interrogationposition corresponding to the position of possible substitution in thetarget sequence, one cell with an □N□, one cell with one of □M□ or □K□,and one cell with □R□ or □Y□. An interrogation position corresponds to anucleotide in the target sequence if it aligns adjacent with thatnucleotide when the probe and target sequence are aligned to maximize 45complementarity. Note that although each of the pooled probes has twoother pooled positions, these positions are not relevant for the presentillustration. The positions are only relevant when more than oneposition in the target sequence is to be read, a circumstance that willbe considered later. For present purposes, the cell with the □N□ in theinterrogation position lights up for the wildtype sequence and any ofthe three single base substitutions of the target sequence.

A further class of strategies involving pooled probes are termed codingstrategies. These strategies assign code words from some set of numbersto variants of a reference sequence.

Any number of variants can be coded. The variants can include multipleclosely spaced substitutions, deletions or insertions. The designationletters or other symbols assigned to each variant may be any arbitraryset of numbers, in any order. For example, a binary code is often used,but codes to other bases are entirely feasible. The numbers are oftenassigned such that each variant has a designation having at least onedigit and at least one nonzero value for that digit.

For example, in a binary system, a variant assigned the number 101, hasa designation of three digits, with one possible nonzero value for eachdigit.

The designation of the variants are coded into an array of pooled probescomprising a pooled probe for each nonzero value of each digit in thenumbers assigned to the variants.

For example, if the variants are assigned successive number in anumbering system of base m, and the highest number assigned to a varianthas n digits, the array would have about n×(m−1) pooled probes. Ingeneral, log_(m)(3N+1) probes are required to analyze all variants of Nlocations in a reference sequence, each having three possible mutantsubstitutions.

For example, 10 base pairs of sequence may be analyzed with only 5pooled probes using a binary coding system.

Each pooled probe has a segment exactly complementary to the referencesequence except that certain positions are pooled.

The segment should be sufficiently long to allow specific hybridizationof the pooled probe to the reference sequence relative to a mutated formof the reference sequence. As in other tiling strategies, segmentslengths of 9-21 nucleotides are typical. Often the probe has nonucleotides other than the 9-21 nucleotide segment. The pooled positionscomprise nucleotides that allow the pooled probe to hybridize to everyvariant assigned a particular nonzero value in a particular digit.Usually, the pooled positions further comprises a nucleotide that allowsthe pooled probe to hybridize to the reference sequence. Thus, awildtype target (or reference sequence) is immediately recognizable fromall the pooled probes being lit.

When a target is hybridized to the pools, only those pools comprising acomponent probe having a segment that is exactly complementary to thetarget light up. The identity of the target is then decoded from thepattern of hybridizing pools. Each pool that lights up is correlatedwith a particular value in a particular digit. Thus, the aggregatehybridization patterns of each lighting pool reveal the value of eachdigit in the code defining the identity of the target hybridized to thearray.

1.2.1.1.5.9. Bridging Strategy

Probes that contain partial matches to two separate (i.e., noncontiguous) subsequences of a target sequence sometimes hybridizestrongly to the target sequence. In certain instances, such probes havegenerated stronger signals than probes of the same length which areperfect matches to the target sequence. It is believed (but notnecessary to the invention) that this observation results frominteractions of a single target sequence with two or more probessimultaneously. This invention exploits this observation to providearrays of probes having at least first and second segments, which arerespectively complementary to first and second subsequences of areference sequence. Optionally, the probes may have a third or morecomplementary segments. These probes can be employed in any of thestrategies noted above.

The two segments of such a probe can be complementary to disjointsubsequences of the reference sequences or contiguous subsequences. * Ifthe latter, the two segments in the probe are inverted relative to theorder of the complement of the reference sequence. The two subsequencesof the reference sequence each typically comprises about 3 to 30contiguous nucleotides. The subsequences of the reference sequence aresometimes separated by 0, 1, 2 or 3 bases. Often the sequences, areadjacent and nonoverlapping.

The bridging strategy offers the following advantages:

(1) Higher discrimination between matched and mismatched probes, (2) Thepossibility of using longer probes in a bridging tiling, therebyincreasing the specificity of the hybridization, without sacrificingdiscrimination, (3) The use of probes in which an interrogation positionis located very off-center relative to the regions of targetcomplementarity. This may be of particular advantage when, for example,when a probe centered about one region of the target gives lowhybridization signal. The low signal is overcome by using a probecentered about an adjoining region giving a higher hybridization signal.(4) Disruption of secondary structure that might result in annealing ofcertain probes (see previous discussion of helper mutations).

1.2.1.1.5.10. Deletion Tiling

Deletion tiling is related to both the bridging and helper mutantstrategies described above. In the deletion strategy, comparisons areperformed between probes sharing a common deletion but differing fromeach other at an interrogation position located outside the deletion.For example, a first probe comprises first and second segments, eachexactly complementary to respective first and second subsequences of areference sequence, wherein the first and second subsequences of thereference sequence are separated by a short distance (e.g., 1 or 2nucleotides). The order of the first and second segments in the probe isusually the same as that of the complement to the first and secondsubsequences in the reference sequence.

Such tilings sometimes offer superior discrimination in hybridizationintensities between the probe having an interrogation positioncomplementary to the target and other probes. Thermodynamically, thedifference between the hybridizations to matched and mismatched targetsfor the probe set shown above is the difference between a single-basebulge, and a large asymmetric loop (e.g., two bases of target, one ofprobe). This often results in a larger difference in stability than thecomparison of a perfectly matched probe with a probe showing a singlebase mismatch in the basic tiling strategy.

The use of deletion or bridging probes is quite general. These probescan be used in any of the tiling strategies of the invention. As well asoffering superior discrimination, the use of deletion or bridgingstrategies is advantageous for certain probes to avoidself-hybridization (either within a probe or between two probes of thesame sequence).

1.2.1.1.6. Preparation of Target Samples

The target polynucleotide, whose sequence is to be determined, isusually isolated from a tissue sample. If the target is genomic, thesample may be from any tissue (except exclusively red blood cells). Forexample, whole blood, peripheral blood lymphocytes or PBMC, skin, hairor semen are convenient sources of clinical samples. These sources arealso suitable if the target is RNA. Blood and other body fluids are alsoa convenient source for isolating viral nucleic acids. If the target ismRNA, the sample is obtained from a tissue in which the mRNA isexpressed. If the polynucleotide in the sample is RNA, it is usuallyreverse transcribed to DNA. DNA samples or cDNA resulting from reversetranscription are usually amplified, e.g., by PCR. Depending on theselection of primers and amplifying enzyme(s), the amplification productcan be RNA or DNA.

Paired primers are selected to flank the borders of a targetpolynucleotide of interest. More than one target can be simultaneouslyamplified by multiplex PCR in which multiple paired primers areemployed. The target can be labelled at one or more nucleotides duringor after amplification. For some target polynucleotides (depending onsize of sample), e.g., episomal DNA, sufficient DNA is present in thetissue sample to dispense with the amplification step.

When the target strand is prepared in single-stranded form as inpreparation of target RNA, the sense of the strand should of course becomplementary to that of the probes on the chip. This is achieved byappropriate selection of primers.

The target is preferably fragmented before application to the chip toreduce or eliminate the formation of secondary structures in the target.The average size of targets segments following hybridization is usuallylarger than the size of probe on the chip.

1.2.1.2. Sequencing

This invention provides that the method of performing whole cellengineering may comprise the step of cell screening. In a preferredembodiment, this invention provides that the step of cell screening maycomprise the step of genomic sequencing. In one exemplification, genomesequencing can be accomplished according to the enzymatic/Sanger method(described in F. Sanger, S. Nicklen, and A. R. Coulson, Proc. Natl.Acad. Sci, USA, 74: 5463-5467 (1977)) and involve cloning and subcloning(described in U.S. Pat. No. 4,725,677; Chen and Seeburg, DNA 4, 165-170(1985); Lim et al., Gene Anal., Techn. 5, 32-39 (1988); PCR Protocols—AGuide to Methods and Applications. Innis et al., editors, AcademicPress, San Diego (1990); Innis et al., Proc. Nat. Acad. Sci. USA 85,9436-9440 (1988)).

In another exemplification, sequencing can be accomplished according tothe chemical/Maxam and Gilbert method which is described in references:A. M. Maxam, and W. Gilbert, Proc. Nat. Acad. of Sci., USA, 74: 560-564(1977) and Church et al., Proc. Natl. Acad. Sci., 81: 1991 (1984). Inadditional exemplifications, genome sequencing can be accomplished bymethodology described by Guo and Wu (Guo and Wu, Nucleic Acids Res., 10:2065 (1982); and Meth. Enz., 100: 60 (1983)) or those methods thatutilize 3′hydroxy-protected and labeled nucleotides as exemplified inthe following references: Churchich, J. E., Eur. J. Biochem., 231: 736(1995); Metzket, M. L. et al., Nucleic Acids Research, 22: 4259 (1994);Beabealashvilli, R. S. et al, Biochimica et Biophysica Acta, 868: 136(1986); Chidgeavadze, Z. G.; Kukhanova, M. K. et al. Biochimica etBiophysica Acta, 868: 145 (1986); Hiratsuka, T et Biophysica Acta, 742:496 (1983); Jeng, S. J. and Guillory, R. J. J., SupramolecularStructure, 3: 448 (1975).

The invention also provides that sequencing may be read byautoradiography using radioisotopes (as described in Ornstein et al.,Biotechniques 2, 476 (1985)) or by using non-radioactively labelingstrategies that have been integrated into partly automated DNAsequencing procedures (Smith et al., Nature M, 674-679 (1986) and EPOPatent No. 873 00998.9; Du Pont De Nemours EPO Application No. 03 59225;Ansorge et al., L Biochem. Biophys. Method 13, 325-32 (19860; Prober etal. Science M, 33641 (1987); Applied Biosystems, PCT Application WO91/05060; Smith et al., Science 235, G89 (1987); U.S. Pat. Nos. 570,973and 689,013), Du Pont De Nemours, U.S. Pat. Nos. 881,372 and 57,566,Ansorge et al. Nucleic Acids Res. 15-, 4593-4602 (1987) and EMBL PatentApplication DE P3724442 and P3805808.1) and Hitachi (JP 1-90844 and DE4011991 A1; U.S. Pat. No. 4,729,947; PCT Application WO92/02635; U.S.Pat. No. 594,676; Beck, O'Keefe, Coull and Köster, Nucleic Acids Res. 7,5115-5123 (1989).L7 and Beck and Köster, Anal. Chem. 62 2258-2270(1990); Church et al., Science 240, 185-188 (1988); Köster et al.,Nucleic Acids Res. Symposium Ser. No. 24, 318-321 (1991), University ofUtah, PCT Application No. WO 90/15883; Smith et al., Nature (1986) 321:674-679; Orion-Yhtyma Oy, U.S. Pat. No. 277,643; M. Uhlen et al. NucleicAcids Res. 16, 3025-38 (1988); Cemu Bioteknik, PCT Application No. WO89/09282 and Medical Research Council, GB, PCT Application No. WO92/03575; Du Pont De Nemours, PCT Application WO 91/11533).

In addition, this invention provides for various methods of readingsequencing data such as capillary zone electrophoresis (described inJorgenson et al., J. Chromatography 352, 337 (1986); Gesteland et al.,Nucleic Acids Res. 18, 1415-1419 (1990)), mass spectrometry (includingES [described in Fenn et al. J. Phys. Chem. 18, 4451-59 (1984); PCTApplication No. WO 90/14148; R. D. Smith et al., Anal. Chem. 62, 882-89(1990) and B. Ardrey, Electrospray Mass Spectrometry, SpectroscopyEurope 4, 10-18 (1992)] and MALDI [Hillenkamp et al. Matrix AssistedUV-Laser Desorption/Ionization: A New Approach to Mass Spectrometry ofLarge Biomolecules, Biological Mass Spectrometry (Burlingame andMcCloskey, editors), Elsevier Science Publishers, Amsterdam, pp. 49-60,(1990); Williams et al., Science, 246, 1585-87 (1989); Williams et al.,Rapid Communications in Mass Spectrometry, 4, 348-351 (1990)]), tube gelelectrophoresis and a mass analyzer to sequence (described in EPO PatentApplications No. 0360676 A1 and 0360677). In order to analyze thesequencing data, this invention provides for the use of probes in largearrays (as described in PCT patent Publication No. 92/10588; U.S. Pat.No. 5,143,854; U.S. application Ser. No. 07/805,727; U.S. Pat. No.5,202,231; PCT patent Publication No. 89/10977).

This invention provides that the method of performing whole cellengineering may comprise the step of cell screening which in aparticular embodiment may include the method of DNA amplification. In aparticular embodiment, this invention provides that DNA amplification.DNA can be amplified by a variety of procedures including cloning(Sambrook et at., Molecular Cloning: A Laboratory Manual., Cold SpringHarbor Laboratory Press, 1989), polymerase chain reaction (PCR) (C. R.Newton and A. Graham, PCF, BIOS Publishers, 1994; Bevan et al.,“Sequencing of PCR-Amplified DNA” PCR Meth. App. 4: 222 (1992)), ligasechain reaction (LCR) (F. Barany Proc. Natl. Acad Sci USA 88, 189-93(1991), strand displacement amplification (SDA) (G. Terrance Walker etal., Nucleic Acids Res. 22, 2670-77 (1994)) and variations such asRT-PCR (Arens, M. Clin Microbiol Rev, 12(4): 612-26 (1999)),allele-specific amplification (ASA) (Nichols, W. C. et al. Genomics.October; 5(3): 53540(1989); Giffard, P. M. et al. Anal Biochem, 292(2):207-15 (2001)).

In additional embodiments of this invention, it provides for additionalsequencing methods (as described in Labeit et al., MA 5, 173-177 (1986);Amersham, PCT-Application GB86/00349; Eckstein et al., Nucleic AcidsRes. 1˜, 9947 (1988); Max-Planck-Geselischaft, DE 3930312 A1; Saiki, R.et al., Science 239: 487-491 (1998); Sarkat, G. and Bolander Mark E.,Semi Exponential Cycle Sequencing Nucleic Acids Research, 1995, Vol. 23,No. 7, p. 1269-1270).

This invention also provides for the following sequencing strategies:shotgun sequencing, transposon-mediated directed sequencing (Strathmann,M. et al. Proc Natl Acad Sci USA (1991) 88: 1247-1250), and large scalevariations thereof (as exemplified in K. B. Mullis et al., U.S. Pat. No.4,683,202; July/1987; 435/91; and U.S. Pat. No. 4,683,195, July/1987;435/6).

According to alternative embodiments of this invention, the step ofgenomic sequencing may include constructing ordered clone maps of DNAsequencing (as described in sections of U.S. Pat. No. 5,604,100 and PCTPatent Publication No. WO9627025). This invention provides that themethod of genome sequencing be achieved by various steps that mayutilize modifications of certain methods mentioned above (described inthe following patents: PCT Publication Nos. WO9737041, WO9742348,WO9627025, WO9831834, WO9500530, and WO9831833; U.S. Pat. No. 5,604,100,U.S. Pat. No. 5,670,321, U.S. Pat. No. 5,453,247, U.S. Pat. No.5,994,058, and U.S. Pat. No. 5,354,656).

1.2.1.3. Annotating

In one aspect this invention discloses the use of a relational databasesystem for storing and manipulating biomolecular sequence informationand storing and displaying genetic information, the database includinggenomic libraries for a plurality of types of organisms, the librarieshaving multiple genomic sequences, at least some of which represent openreading frames located along a contiguous sequence on each the pluralityof organisms' genomes, and a user interface capable of receiving aselection of two or more of the genomic libraries for comparison anddisplaying the results of the comparison. Associated with the databaseis a software system that allows a user to determine the relativeposition of a selected gene sequence within a genome. The system allowsexecution of a method of displaying the genetic locus of a biomolecularsequence. The method involves providing a database including multiplebiomolecular sequences, at least some of which represent open readingframes located along a contiguous sequence on an organism's genome. Thesystem also provides a user interface capable of receiving a selectionof one or more probe open reading frames for use in determininghomologous matches between such probe open reading frame(s) and the openreading frames in the genomic libraries, and displaying the results ofthe determination. An open reading frame for the sequence is selectedand displayed together with adjacent open reading frames locatedupstream and downstream in the relative positions in which they occur onthe contiguous sequence.

Also disclosed is a relational database system for storing biomolecularsequence information in a manner that allows sequences to be cataloguedand searched according to one or more protein function hierarchies. Thehierarchies allow searches for sequences based upon a protein'sbiological function or molecular function. Also disclosed is a mechanismfor automatically grouping new sequences into protein functionhierarchies. This mechanism uses descriptive information obtained from“external hits” which are matches of stored sequences against genesequences stored in an external database such as GenBank. Thedescriptive information provided with the external database is evaluatedaccording to a specific algorithm and used to automatically group theexternal hits (or the sequences associated with the hits) in thecategories. Ultimately, the biomolecular sequences stored in databasesof this invention are provided with both descriptive information fromthe external hit and category information from a relevant hierarchy orhierarchies.

Disclosed is a relational database system for storing biomolecularsequence information in a manner that allows sequences to be cataloguedand searched according to association with one or more projects forobtaining full-length biomolecular sequences from shorter sequences. Therelational database has sequence records containing informationidentifying one or more projects to which each of the sequence recordsbelong. Each project groups together one or more biomolecular sequencesgenerated during work to obtain a full-length gene sequence from ashorter sequence. The computer system has a user interface allowing auser to selectively view information regarding one or more projects. Therelational database also provides interfaces and methods for accessingand manipulating and analyzing project-based information.

Polymer sequences are assembled into bins. A first number of bins arepopulated with polymer sequences. The polymer sequences in each bin areassembled into one or more consensus sequences representative of thepolymer sequences of the bin. The consensus sequences of the bins arecompared to determine relationships, if any, between the consensussequences of the bins. The bins are modified based on the relationshipsbetween the consensus sequences of the bins. The polymer sequences arereassembled in the modified bins to generate one or more modifiedconsensus sequences for each bin representative of the modified bins. Inanother aspect of the invention, sequence similarities anddissimilarities are analyzed in a set of polymer sequences. Pairwisealignment data is generated for pairs of the polymer sequences. Thepairwise alignment data defines regions of similarity between the pairsof polymer sequences with boundaries. Additional boundaries inparticular polymer sequences are determined by applying at least oneboundary from at least one pairwise alignment for one pair of polymersequences to at least one other pairwise alignment for another pair ofpolymer sequences including one of the particular polymer sequences.Additional regions of similarity are generated based on the boundaries.

1.2.1.3.1. Annotating—General Methodology

In one aspect this present invention relates generally to relationaldatabases for storing and retrieving biological information. Moreparticularly the invention relates to systems and methods for providingsequences of biological molecules in a relational format allowingretrieval in a client-server environment and for providing full-lengthcDNA sequences in a relational format allowing retrieval in aclient-server environment.

Informatics is the study and application of computer and statisticaltechniques to the management of information. In genome projects,bioinformatics includes the development of methods to search databasesquickly, to analyze nucleic acid sequence information, and to predictprotein sequence, structure and function from DNA sequence data.

Increasingly, molecular biology is shifting from the laboratory bench tothe computer desktop. Today's researchers require advanced quantitativeanalyses, database comparisons, and computational algorithms to explorethe relationships between sequence and phenotype. Thus, by all accounts,researchers can not and will not be able to avoid using computerresources to explore gene expression, gene sequencing and molecularstructure.

One use of bioinformatics involves studying an organism's genome todetermine the sequence and placement of its genes and their relationshipto other sequences and genes within the genome or to genes in otherorganisms. Another use of bioinformatics involves studying genesdifferentially or commonly expressed in different tissues or cell lines(e.g. normal and cancerous tissue).

Such information is of significant interest in biomedical andpharmaceutical research, for instance to assist in the evaluation ofdrug efficacy and resistance.

The sequence tag method involves generation of a large number (e.g.,thousands) of Expressed Sequence Tags (“ESTs”) from cDNA libraries (eachproduced from a different tissue or sample). ESTs are partial transcriptsequences that may cover different parts of the cDNA(s) of a gene,depending on cloning and sequencing strategy. Each EST includes about 50to 300 nucleotides. If it is assumed that the number of tags isproportional to the abundance of transcripts in the tissue or cell typeused to make the cDNA library, then any variation in the relativefrequency of those tags, stored in computer databases, can be used todetect the differential abundance and potentially the expression of thecorresponding genes.

To make genomic and EST information manipulation easy to perform andunderstand, sophisticated computer database systems have been developed.In one database system, developed by Incyte Pharmaceuticals, Inc. ofPalo Alto, Calif., genomic sequence data and the abundance levels ofmRNA species represented in a given sample is electronically recordedand annotated with information available from public sequence databasessuch as GenBank. Examples of such databases include GenBank (NCBI) andTIGR. The resulting information is stored in a relational database thatmay be employed to determine relationships between sequences and geneswithin and among genomes and establish a cDNA profile for a given tissueand to evaluate changes in gene expression caused by diseaseprogression, pharmacological treatment, aging, etc.

In one database system, developed by Incyte Pharmaceuticals, Inc. ofPalo Alto, Calif., abundance levels of mRNA species represented in agiven sample are electronically recorded and annotated with informationavailable from public sequence databases such as GenBank. The resultinginformation is stored in a relational database that may be employed toestablish a cDNA profile for a given tissue and to evaluate changes ingene expression caused by disease progression, pharmacologicaltreatment, aging, etc.

Genetic information for a number of organisms has been catalogued incomputer databases. Genetic databases for organisms such as Eschericiacoli, Haemophilus influenzae, Mycoplasma genitalium, and Mycoplasmapneumoniae, among others, are publicly available. At present, however,complete sequence data is available for relatively few species, and theability to manipulate sequence data within and between species anddatabases is limited.

While genetic data processing and relational database systems such asthose developed by Incyte Pharmaceuticals, Inc. provide great power andflexibility in analyzing genetic information and gene expressioninformation, this area of technology is still in its infancy and furtherimprovements in genetic data processing and relational database systemsand their content will help accelerate biological research for numerousapplications.

In genome projects, bioinformatics includes the development of methodsto search databases quickly, to analyze nucleic acid sequenceinformation, and to predict protein sequence and structure from DNAsequence data. Increasingly, molecular biology is shifting from thelaboratory bench to the computer desktop. Advanced quantitativeanalyses, database comparisons, and computational algorithms are neededto explore the relationships between sequence and phenotype.

1.2.1.3.2. Annotating—Exemplary Aspects

The annotation methods of this invention include those described in PCTpatent publication Nos. 98/26407, 98/26408, and 99/49403 and U.S. Pat.Nos. 6,023,659 and 5,953,727 and are herein incorporated by reference intheir entirety to the same extent as if each individual patent or patentapplication were specifically and individually indicated to beincorporated by reference in its entirety.

Thus, in one aspect, this present invention provides relational databasesystems for storing and analyzing biomolecular sequence informationtogether with biological annotations detailing the source andinterpretation the sequence data. The present invention provides apowerful database tool for drug development and other research anddevelopment purposes.

The present invention provides relational database systems for storingand analyzing biomolecular sequence information together with biologicaldetailing the source and interpretation the sequence data. Disclosed isa relational database systems for storing and displaying geneticinformation.

Associated with the database is a software system the allows a user todetermine the relative position of a selected gene sequence within agenome. The system allows execution of a method of displaying thegenetic locus of a biomolecular sequence. The method involves providinga database including multiple biomolecular sequences, at least some ofwhich represent open reading frames located along a contiguous sequenceon an organism's genome. An open reading frame for the sequence isselected and displayed together with adjacent open reading frameslocated upstream and downstream in the relative positions in which theyoccur on the contiguous sequence.

The invention provides a method of displaying the genetic locus of abiomolecular sequence. The method involve providing a database includingmultiple biomolecular sequences, at least some of which represent openreading frames located along a contiguous sequence on an organism'sgenome. The method further involves identifying a selected open readingframe, and displaying the selected open reading frame together withadjacent open reading frames located upstream and downstream from theselected open reading frame.

The adjacent open reading frames and the selected open reading frame aredisplayed in the relative positions in which they occur on thecontiguous sequence, textually and/or graphically. The method of theinvention may be practiced with sequences from microbial organisms, andthe sequences may include nucleic acid or protein sequences.

The invention also provides a computer system including a databasehaving multiple biomolecular sequences, at least some of which representopen reading frames located along a contiguous sequence on an organism□sgenome.

The computer system also includes a user interface capable ofidentifying a selected open reading frame, and displaying the selectedopen reading frame together with adjacent open reading frames locatedupstream and downstream from the selected open reading frame. Theadjacent the open reading frames and the selected open reading frame aredisplayed in the relative positions in which they occur on thecontiguous sequence. The user interface may also capable of detecting ascrolling command, and based upon the direction and magnitude of thescrolling command, identifying a new selected open reading frame fromthe contiguous sequence.

The invention further provides a computer program product comprising acomputer-usable medium having computer-readable program code embodiedthereon relating to a database including multiple biomolecularsequences, at least some of which represent open reading frames locatedalong a contiguous sequence on an organism's genome. The computerprogram product includes computer-readable program code for identifyinga selected open reading frame, and displaying the selected open readingframe together with adjacent open reading frames located upstream anddownstream from the selected open reading frame. The adjacent openreading frames and the selected open reading frame are displayed in therelative positions in which they occur on the contiguous sequence.

Comparative Genomics is a feature of the database system of the presentinvention which allows a user to compare the sequence data of sets ofdifferent organism types. Comparative searches may be formulated in anumber of ways using the Comparative Genomics feature. For example,genes common to a set of organisms may be identified through a“commonality” query, and genes unique to one of a set of organisms maybe identified through a “subtraction” query.

Electronic Southern is a feature of the present database system which isuseful for identifying genomic libraries in which a given gene or ORFexists.

A Southern analysis is a conventional molecular biology technique inwhich a nucleic acid of known sequence is used to identify matching(complementary) sequences in a sample of nucleic acid to be analyzed.Like their laboratory counterparts, Electronic Southerns according tothe present invention may be used to locate homologous matches between a“probe” DNA sequence and a large number of DNA sequences in one or morelibraries.

The present invention provides a method of comparing genetic complementsof different types of organisms. The method involves providing adatabase having sequence libraries with multiple biomolecular sequencesfor different types of organisms, where at least some of the sequencesrepresent open reading frames located along one or more contiguoussequences on each of the organisms' genomes. The method further involvesreceiving a selection of two or more of the sequence libraries forcomparison, determining open reading frames common or unique to theselected sequence libraries, and displaying the results of thedetermination.

The invention also provides a method of comparing genomic complements ofdifferent types of organisms. The method involves providing a databasehaving genomic sequence libraries with multiple biomolecular sequencesfor different types of organisms, where at least some of the sequencesrepresent open reading frames located along one or more contiguoussequences on each of the organisms' genomes. The method further involvesreceiving a selection of two or more of the sequence libraries forcomparison, determining sequences common or unique to the selectedsequence libraries, and displaying the results of the determination.

The invention further provides a computer system including a databasecontaining genomic libraries for different types of organisms, whichlibraries have multiple genomic sequences, at least some of whichrepresenting open reading frames located along one or more contiguoussequences on each the organisms' genomes. The system also includes auser interface capable of receiving a selection of two or more genomiclibraries for comparison and displaying the results of the comparison.

Another aspect of the present invention provides a method of identifyinglibraries in which a given gene exists. The method involves providing adatabase including genomic libraries for one or more types of organisms.The libraries have multiple genomic sequences, at least some of whichrepresent open reading frames located along one or more contiguoussequences on each the organisms' genomes. The method further involvesreceiving a selection of one or more probe sequences, determininghomologous matches between the selected probe sequences and thesequences in the genomic libraries, and displaying the results of thedetermination.

The invention also provides a computer system including a databaseincluding genomic libraries for one or more types of organisms, whichlibraries have multiple genomic sequences, at least some of whichrepresent open reading frames located along one or more contiguoussequences on each the organisms' genomes. The system also includes auser interface capable of receiving a selection of one or more probesequences for use in determining homologous matches between one or moreprobe sequences and the sequences in the genomic libraries, anddisplaying the results of the determination.

Also provided is a computer program product including a computer-usablemedium having computer-readable program code embodied thereon relatingto a database including genomic libraries for one or more types oforganisms. The libraries have multiple genomic sequences, at least someof which represent open reading frames located along one or morecontiguous sequences on each the organisms' genomes. The computerprogram product includes computer-readable program code for providing,within a computing system, an interface for receiving a selection of twoor more genomic libraries for comparison, determining sequences commonor unique to the selected genomic libraries, and displaying the resultsof the determination.

Additionally provided is a computer program product including acomputer-usable medium having computer-readable program code embodiedthereon relating to a database including genomic libraries for one ormore types of organisms. The libraries have multiple genomic sequences,at least some of which represent open reading frames located along oneor more contiguous sequences on each the organisms' genomes. Thecomputer program product includes computer-readable program code forproviding, within a computing system, an interface for receiving aselection of one or more probe open reading frames, determininghomologous matches between the probe sequences and the sequences in thegenomic libraries, and displaying the results of the determination.

The invention further provides a method of presenting the geneticcomplement of an organism. The method involves providing a databaseincluding sequence libraries for a plurality of types of organisms,where the libraries have multiple biomolecular sequences, at least someof which represent open reading frames located along one or morecontiguous sequences on each of the organisms' genomes. The methodfurther involves receiving a selection of one of the sequence libraries,determining open reading frames within the selected sequence library,and displaying the results as one or more unique identifiers for groupsof related opening reading frames.

The present invention provides relational database systems for storingbiomolecular sequence information in a manner that allows sequences tobe catalogued and searched according to one or more protein functionhierarchies. The hierarchies are provided to allow carefully tailoredsearches for sequences based upon a protein's biological function ormolecular function. To make this capability available in large sequencedatabases, the invention provides a mechanism for automatically groupingnew sequences into protein function hierarchies. This mechanism takesadvantage of descriptive information obtained from “external hits” whichare matches of stored sequences against gene sequences stored in anexternal database such as GenBank. The descriptive information providedwith GenBank is evaluated according to a specific algorithm and used toautomatically group the external hits (or the sequences associated withthe hits) in the categories. Ultimately, the biomolecular sequencesstored in databases of this invention are provided with both descriptiveinformation from the external hit and category information from arelevant hierarchy or hierarchies.

The invention provides a computer system having a database containingrecords pertaining to a plurality of biomolecular sequences. At leastsome of the biomolecular sequences are grouped into a first hierarchy ofprotein function categories, the protein function categories specifyingbiological functions of proteins corresponding to the biomolecularsequences and the first hierarchy. The hierarchy includes a first set ofprotein function categories specifying biological functions at acellular level, and a second set of protein function categoriesspecifying biological functions at a level above the cellular level. Thecomputer system of the invention also includes a user interface allowinga user to selectively view information regarding the plurality ofbiomolecular sequences as it relates to the first hierarchy. Thecomputer system may also include additional protein function categoriesbased, for example, on molecular or enzymatic function of proteins. Thebiomolecular sequences may include nucleic acid or amino acid sequences.Some of said biomolecular sequences may be provided as part of one ormore projects for obtaining full-length gene sequences from shortersequences, and the database records may contain information about suchprojects.

The invention also provides a method of using a computer system topresent information pertaining to a plurality of biomolecular sequencerecords stored in a database. The method involves displaying a list ofthe records or a field for entering information identifying one or moreof the records, identifying one or more of the records that a user hasselected from the list or field, matching the one or more selectedrecords with one or more protein function categories from a firsthierarchy of protein function categories into which at least some of thebiomolecular sequence records are grouped, and displaying the one ormore categories matching the one or more selected records. The proteinfunction categories specify biological functions of proteinscorresponding to the biomolecular sequences and the first hierarchyincludes a first set of protein function categories specifyingbiological functions at a cellular level, and a second set of proteinfunction categories specifying biological functions at a tissue level.The method may also involve matching the records against other proteinfunction hierarchies, such as hierarchies based on molecular and/orenzymatic function, and displaying the results. At least some of thebiomolecular sequences may be provided as part of one or more projectsfor obtaining full-length gene sequences from shorter sequences, and thedatabase records may contain information about those projects.

Additionally, the invention provides a method of using a computer systemto present information pertaining to a plurality of biomolecularsequence records stored in a database. The method involves displaying alist of one or more protein biological function categories from a firsthierarchy of protein biological function categories into which at leastsome of the biomolecular sequence records are grouped, identifying oneor more of the protein biological function categories that a user hasselected from the list, matching the one or more selected proteinbiological function categories with one or more biomolecular sequencerecords which are grouped in the selected protein biological functioncategories, and displaying the one or more sequence records matching theone or more selected protein biological function categories. The proteinbiological function categories specify biological functions of proteinscorresponding to the biomolecular sequences and the first hierarchyincludes a first set of protein biological function categoriesspecifying biological functions at a cellular level, and a second set ofprotein biological function categories specifying biological functionsat a tissue level. The method may also involve matching the recordsagainst other protein function hierarchies, such as hierarchies based onmolecular and/or enzymatic function, and displaying the results. Atleast some of the biomolecular sequences may be provided as part of oneor more projects for obtaining full-length gene sequences from shortersequences, and the database records may contain information about thoseprojects.

Another aspect of the invention provides a database system having aplurality of internal records. The database includes a plurality ofsequence records specifying biomolecular sequences, at least some ofwhich records reference hits to an external database, which hits specifygenes having sequences that at least partially match those of thebiomolecular sequences. The database also includes a plurality ofexternal hit records specifying the hits to the external database, andat least some of the records reference protein function hierarchycategories which specify at least one of biological functions ofproteins or molecular functions of proteins. At least some of thebiomolecular sequences may be provided as part of one or more projectsfor obtaining full-length gene sequences from shorter sequences, and thedatabase records may contain information about those projects.

Further aspects of the present invention provide a method of using acomputer system and a computer readable medium having programinstructions to automatically categorize biomolecular sequence recordsinto protein function categories in an internal database. The method andprogram involve receiving descriptive information about a biomolecularsequence in the internal database from a record in an external databasepertaining to a gene having a sequence that at least partially matchesthat of the biomolecular sequence. Next, a determination is made whetherthe descriptive information contains one or more terms matching one ormore keywords associated with a first protein function category, thekeywords being terms consistent with a classification in the firstprotein function category. When at least one keyword is found to match aterm in the descriptive information, a determination is made whether thedescriptive information contains a term matching one or moreanti-keywords associated with the first protein function category, theanti-keywords being terms inconsistent with a classification in thefirst protein function category. Then, the biomolecular sequence isgrouped in the first protein function category when the descriptiveinformation contains a term matching a keyword but contains no termmatching an anti-keyword.

with reference to the drawings,

The present invention provides relational database systems for storingbiomolecular sequence information in a manner that allows sequences tobe catalogued and searched according to one or more characteristics. Thesequence information of the database is generated by one or more“projects” which are concerned with identifying the full-length codingsequence of a gene (i.e., mRNA). The projects involve the extension ofan initial sequenced portion of a clone of a gene of interest (e.g., anEST) by a variety of methods which use conventional molecular biologicaltechniques, recently developed adaptations of these techniques, andcertain novel database applications. Data accumulated in these projectsmay be provided to the database of the present invention throughout thecourse of the projects and may be available to database users(subscribers) throughout the course of these projects for research,product (i.e., drug) development, and other purposes.

In a preferred embodiment, the database of the present invention and itsassociated projects may provide sequence and related data in amounts andforms not previously available. The present invention preferably makespartial and full-length sequence information for a given gene availableto a user both during the course of the data acquisition and once thefull-length sequence of the gene has been elucidated. The database alsopreferably provides a variety of tools for analysis and manipulation ofthe data, including Northern analysis and Expression summaries. Thepresent invention should permit more complete and accurate annotation ofsequence data, as well as the study of relationships between genes ofdifferent tissues, systems or organisms, and ultimately detailedexpression studies of full-length gene sequences.

The invention provides a computer system including a database havingsequence records containing information identifying one or more projectsto which each of the sequence records belong. Each project groupstogether one or more biomolecular sequences generated during work toobtain a full-length gene sequence from a shorter sequence. The computersystem also has a user interface allowing a user to selectively viewinformation regarding one or more projects. The biomolecular sequencesmay include nucleic acid or amino acid sequences. The user interface mayallow users to view at least three levels of project informationincluding a project information results level listing at least some ofthe projects in said database, a sequence information results levellisting at least some of the sequences associated with a given project,and a sequence retrieval results level sequentially listing monomerswhich comprise a given sequence.

A method of using a computer system and a computer program product topresent information pertaining to a plurality of sequence records storedin a database are also provided by the present invention. The sequencerecords contain information identifying one or more projects to whicheach of the sequence records belong. Each of the projects groups one ormore biomolecular sequences generated during work to obtain afull-length gene sequence from a shorter sequence. The method andprogram involve providing an interface for entering query informationrelating to one or more projects, locating data corresponding to theentered query information, and displaying the data corresponding to theentered query information.

Additionally, the invention provides a method of using a computer systemto present information pertaining to a plurality of sequence recordsstored in a database. The sequence records contains informationidentifying one or more projects to which each of the sequence recordsbelong. Each of the projects groups one or more biomolecular sequencesgenerated during work to obtain a full-length gene sequence from ashorter sequence. The method involves displaying a list of one or moreproject identifiers, determining which project identifier or identifiersfrom the list is selected by a user, then displaying a second list ofone or more biomolecular sequence identifiers associated with theselected project identifier or identifiers, determining which sequenceidentifier or identifiers from the second list has been selected by auser, and displaying a third list of one or more sequences correspondingto the selected sequence identifier or identifiers. Following thedisplay of the third list, a determination may be made whether and whichsequence from the third list has been selected by a user. If a sequenceis selected, a sequence alignment search of the selected sequenceagainst other databased sequences may be initiated, and the results ofthe alignment search displayed.

For Electronic Northern analysis, the invention further provides acomputer system including a database having sequence records containinginformation identifying one or more projects to which each of thesequence records belong, each of said projects grouping one or morebiomolecular sequences generated during work to obtain a full-lengthgene sequence from a shorter sequence. The system also has a userinterface capable of allowing a user to select one or more projectidentifiers or project member identifiers specifying one or moresequences to be compared with one or more cDNA sequence libraries, anddisplaying matches resulting from that comparison.

A method of using a computer system to present comparative informationpertaining to a plurality of sequence records stored in a database isalso provided by the present invention. The sequence records containinformation identifying one or more projects to which each of thesequence records belong, each of the projects grouping one or morebiomolecular sequences generated during work to obtain a full-lengthgene sequence from a shorter sequence. The method involves providing aninterface capable of allowing a user to select one or more projectidentifiers or project member identifiers specifying one or moresequences, comparing the one or more specified sequences with one ormore cDNA sequence libraries, and displaying matches resulting from thecomparison.

In addition, for Expression analysis, the invention provides a computersystem including a database having sequence records containinginformation identifying one or more projects to which each of thesequence records belong, each of the projects grouping one or morebiomolecular sequences generated during work to obtain a full-lengthgene sequence from a shorter sequence. The system also has a userinterface allowing a user to view expression information pertaining tothe projects by selecting one or more expression categories for a query,and displaying the result of the query.

A method of using a computer system to view expression informationpertaining to one or more projects, each of the projects grouping one ormore biomolecular sequences generated during work to obtain afull-length gene sequence from a shorter sequence, is also provided inaccordance with the present invention. The computer system includes adatabase storing a plurality of sequence records, the sequence recordscontaining information identifying one or more projects to which each ofthe sequence records belong. The method involves providing an interfacewhich allows a user to select one or more expression categories as aquery, locating projects belonging to the selected one or moreexpression categories, and displaying a list of located projects.

Finally, the present invention provides a computer system including adatabase having sequence records containing information identifying oneor more projects to which each of the sequence records belong, each ofthe projects grouping one or more biomolecular sequences generatedduring work to obtain a full-length gene sequence from a shortersequence. This computer system has a user interface allowing a user toselectively view information regarding said one or more projects andwhich displays information to a user in a format common to one or moreother sequence databases. These and other features and advantages of theinvention will be described in more detail below with reference to thedrawings.

Polymer sequences are assembled into bins. A first number of bins arepopulated with polymer sequences. The polymer sequences in each bin areassembled into one or more consensus sequences representative of thepolymer sequences of the bin. The consensus sequences of the bins arecompared to determine relationships, if any, between the consensussequences. The bins are modified based on the relationships between theconsensus sequences. The polymer sequences are reassembled in themodified bins to generate one or more modified consensus sequences foreach bin representative of the modified bins.

In another aspect of the invention, sequence similarities anddissimilarities are analyzed in a set of polymer sequences. Pairwisealignment data is generated for pairs of the polymer sequences. Thepairwise alignment data defines regions of similarity between the pairsof polymer sequences with boundaries. Additional boundaries inparticular polymer sequences are determined by applying at least oneboundary from at least one pairwise alignment for one pair of polymersequences to at least one other pairwise alignment for another pair ofpolymer sequences including one of the particular polymer sequences.Additional regions of similarity are generated based on the boundaries.

1.2.1.33. Annotating—Preferred Embodiments

Generally, the present invention provides an improved relationaldatabase for storing and manipulating genomic sequence information.While the invention is described in terms of a database optimized formicrobial data, it is by no means so limited. The invention may beemployed to investigate data from various sources. For example, theinvention covers databases optimized for other sources of sequence data,such as animal sequences (e.g., human, primate, rodent, amphibian,insect, etc.), plant sequences and microbial sequences. In the followingdescription, numerous specific details are set forth in order to providea thorough understanding of the present invention. It will be apparent,however, that the present invention may be practiced without limitationto some of the specific details presented herein.

Generally, the present invention provides an improved relationaldatabase for storing sequence information. The invention may be employedto investigate data from various sources. For example, it may catalogueanimal sequences (e.g., human, primate, rodent, amphibian, insect,etc.), plant sequences, and microbial sequences.

1.3. Transcriptome Analysis or RNA Profiling

The characterization of RNA expression and transcript populations (thetranscriptome) can be referred to as RNA profiling and/or expressionprofiling, utilizing high throughput techniques such as RNA differentialdisplays and DNA microarrays. One potential method to characterize geneexpression, SAGE (Serial Analysis of Gene Expression) utilizescombinatorial chemistry technology and short sequence tags in thescreening of compound libraries. For further information see references:Burge, C. B. 2001. Chipping away at the transcriptome. Nat Genet, 27(3):2324; Hughes, T. R. and Shoemaker, D. D. 2001. DNA microarrays forexpression profiling. Curr Opin Chem Biol, 5(1): 21-5; Yamamoto, M. etal. 2001. Use of serial analysis of gene expression (SAGE) technology. JImmunol Methods 250(1-2): 45-66.

1.3.1 Screening and Selecting Nucleotides for Protein Binding

An embodiment of this invention provides for screening methods thatinclude the user of recombinant and in vitro chemical synthesis methods.In these hybrid methods, cell-free enzymatic machinery is employed toaccomplish the in vitro synthesis of the library members (i.e., peptidesor polynucleotides). In one type of method, RNA molecules with theability to bind a predetermined protein or a predetermined dye moleculewere selected by alternate rounds of selection and PCR amplification(Tuerk and Gold, 1990; Ellington and Szostak, 1990). A similar techniquewas used to identify DNA sequences which bind a predetermined humantranscription factor (Thiesen and Bach, 1990; Beaudry and Joyce, 1992;PCT patent publications WO 92/05258 and WO 92/14843).

1.4. Proteomics

In another embodiment of this invention, this invention relates to theemerging field of proteomics, Proteomics involves the qualitative andquantitative measurement of gene activity by detecting and quantitatingexpression at the protein level, rather than at the messenger RNA level.Proteomics also involves the study of non-genome encoded events,including the post-translational modification of proteins (includingglycosylation or other modifications), interactions between proteins,and the location of proteins within a cell. The structure, function,and/or level of activity of the proteins expressed by the cell are alsoof interest. Essentially, proteomics involves the study of part or allof the status of the total protein contained within or secreted by acell. Proteomics requires means of separating proteins in complexmixtures and identifying both low-and high-abundance species. Examplesof powerful methods currently used to resolve complex protein mixturesare 2D gel electrophoresis, reverse phase HPLC, capillaryelectrophoresis, isoelectric focusing and related hybrid techniques.Commonly used protein identification techniques include N-terminal Edmanand mass spectrometry (electrospray [ESI] or matrix-assisted laserdesorption ionization [MALDI] MS) and sophisticated database searchprograms, such as SEQUEST, to identify proteins in World Wide Webprotein and nucleic acid databases from the MS-MS spectra of theirpeptides. Using a computer, the output of the mass spectrometry can beanalyzed so as to link a gene and the particular protein for which itcodes. This overall process is sometimes referred to as “functionalgenomics”.

For general information on proteome research, see, for example, J. S.Fruton; 1999, Proteins, Enzymes, Genes: The Interplay of Chemistry andBiology, Yale Univ. Pr.; Wilkins et al., 1997, Proteome Research: NewFrontiers in Functional Genomics (Principles and Practice), SpringerVerlag; A. J. Link, 1999, 2-D Proteome Analysis Protocols (Methods inMolecular Biology, 1112, Humana Pr.); and Kamp et al., 1999, Proteomeand Protein Analysis, Springer Verlag. Signal Transduction See also,James, Peter, “Protein identification in the post-genome era: the rapidrise of proteomics”, Q. Rev. Biophysics, Vol. 30, No. 4, pp. 279-331(1997), which is incorporated by reference, herein.

1.4.1 Screening Peptides: Peptide Display Methods

The present invention is further directed to a method for generating aselected mutant polynucleotide sequence (or a population of selectedpolynucleotide sequences) typically in the form of amplified and/orcloned polynucleotides, whereby the selected polynucleotide sequences(s)possess at least one desired phenotypic characteristic (e.g., encodes apolypeptide, promotes transcription of linked polynucleotides, binds aprotein, and the like) which can be selected for. One method foridentifying hybrid polypeptides that possess a desired structure orfunctional property, such as binding to a predetermined biologicalmacromolecule (e.g., a receptor), involves the screening of a largelibrary of polypeptides for individual library members which possess thedesired structure or functional property conferred by the amino acidsequence of the polypeptide.

One method of screening peptides involves the display of a peptidesequence, antibody, or other protein on the surface of a bacteriophageparticle or cell. Generally, in these methods each bacteriophageparticle or cell serves as an individual library member displaying asingle species of displayed peptide in addition to the naturalbacteriophage or cell protein sequences. Each bacteriophage or cellcontains the nucleotide sequence information encoding the particulardisplayed peptide sequence; thus, the displayed peptide sequence can beascertained by nucleotide sequence determination of an isolated librarymember.

A well-known peptide display method involves the presentation of apeptide sequence on the surface of a filamentous bacteriophage,typically as a fusion with a bacteriophage coat protein. Thebacteriophage library can be incubated with an immobilized,predetermined macromolecule or small molecule (e.g., a receptor) so thatbacteriophage particles which present a peptide sequence that binds tothe immobilized macromolecule can be differentially partitioned fromthose that do not present peptide sequences that bind to thepredetermined macromolecule. The bacteriophage particles (i.e., librarymembers) which are bound to the immobilized macromolecule are thenrecovered and replicated to amplify the selected bacteriophagesub-population for a subsequent round of affinity enrichment and phagereplication. After several rounds of affinity enrichment and phagereplication, the bacteriophage library members that are thus selectedare isolated and the nucleotide sequence encoding the displayed peptidesequence is determined, thereby identifying the sequence(s) of peptidesthat bind to the predetermined macromolecule (e.g., receptor). Suchmethods are further described in PCT patent publications WO 91/17271, WO91/18980, WO 91/19818 and WO 93/08278.

The latter PCT publication describes a recombinant DNA method for thedisplay of peptide ligands that involves the production of a library offusion proteins with each fusion protein composed of a first polypeptideportion, typically comprising a variable sequence, that is available forpotential binding to a predetermined macromolecule, and a secondpolypeptide portion that binds to DNA, such as the DNA vector encodingthe individual fusion protein. When transformed host cells are culturedunder conditions that allow for expression of the fusion protein, thefusion protein binds to the DNA vector encoding it. Upon lysis of thehost cell, the fusion protein/vector DNA complexes can be screenedagainst a predetermined macromolecule in much the same way asbacteriophage particles are screened in the phage-based display system,with the replication and sequencing of the DNA vectors in the selectedfusion protein/vector DNA complexes serving as the basis foridentification of the selected library peptide sequence(s).

The displayed peptide sequences can be of varying lengths, typicallyfrom 3-5000 amino acids long or longer, frequently from 5-100 aminoacids long, and often from about 8-15 amino acids long. A library cancomprise library members having varying lengths of displayed peptidesequence, or may comprise library members having a fixed length ofdisplayed peptide sequence. Portions or all of the displayed peptidesequence(s) can be random, pseudorandom, defined set kernal, fixed, orthe like. The present display methods include methods for in vitro andin vivo display of single-chain antibodies, such as nascent scFv onpolysomes or scfv displayed on phage, which enable large-scale screeningof scfv libraries having broad diversity of variable region sequencesand binding specificities.

The present invention also provides random, pseudorandom, and definedsequence framework peptide libraries and methods for generating andscreening those libraries to identify useful compounds (e.g., peptides,including single-chain antibodies) that bind to receptor molecules orepitopes of interest or gene products that modify peptides or RNA in adesired fashion. The random, pseudorandom, and defined sequenceframework peptides are produced from libraries of peptide librarymembers that comprise displayed peptides or displayed single-chainantibodies attached to a polynucleotide template from which thedisplayed peptide was synthesized. The mode of attachment may varyaccording to the specific embodiment of the invention selected, and caninclude encapsulation in a phage particle or incorporation in a cell.

1.4.2. Screening that Utilizes In Vitro Translation Systems

An embodiment of this invention provides for the use of in vitrotranslation during the step of screening. In vitro translation has beenused to synthesize proteins of interest and has been proposed as amethod for generating large libraries of peptides. These methods,generally comprising stabilized polysome complexes, are describedfurther in WO 91/05058, and WO 92/02536. Applicants have describedmethods in which library members comprise a fusion protein having afirst polypeptide portion with DNA binding activity and a secondpolypeptide portion having the library member unique peptide sequence;such methods are suitable for use in cell-free in vitro selectionformats, among others.

1.4.3. Affinity Enrichment

An aspect of this invention provides for the use of affinity enrichmentwhich allows a very large library of peptides and single-chainantibodies to be screened and the polynucleotide sequence encoding thedesired peptide(s) or single-chain antibodies to be selected. Thepolynucleotide can then be isolated and shuffled to recombinecombinatorially the amino acid sequence of the selected peptide(s) (orpredetermined portions thereof) or single-chain antibodies (or just VHI,VLI or CDR portions thereof). Using these methods, one can identify apeptide or single-chain antibody as having a desired binding affinityfor a molecule and can exploit the process of shuffling to convergerapidly to a desired high-affinity peptide or scfv. The peptide orantibody can then be synthesized in bulk by conventional means for anysuitable use (e.g., as a therapeutic or diagnostic agent).

A significant advantage of the present invention is that no priorinformation regarding an expected ligand structure is required toisolate peptide ligands or antibodies of interest. The peptideidentified can have biological activity, which is meant to include atleast specific binding affinity for a selected receptor molecule and, insome instances, will further include the ability to block the binding ofother compounds, to stimulate or inhibit metabolic pathways, to act as asignal or messenger, to stimulate or inhibit cellular activity, and thelike.

The present invention also provides a method for shuffling a pool ofpolynucleotide sequences selected by affinity screening a library ofpolysomes displaying nascent peptides (including single-chainantibodies) for library members which bind to a predetermined receptor(e.g., a mammalian proteinaceous receptor such as, for example, apeptidergic hormone receptor, a cell surface receptor, an intracellularprotein which binds to other protein(s) to form intracellular proteincomplexes such as hetero-dimers and the like) or epitope (e.g., animmobilized protein, glycoprotein, oligosaccharide, and the like).

The invention also provides peptide libraries comprising a plurality ofindividual library members of the invention, wherein (1) each individuallibrary member of said plurality comprises a sequence produced byshuffling of a pool of selected sequences, and (2) each individuallibrary member comprises a variable peptide segment sequence orsingle-chain antibody segment sequence which is distinct from thevariable peptide segment sequences or single-chain antibody sequences ofother individual library members in said plurality (although somelibrary members may be present in more than one copy per library due touneven amplification, stochastic probability, or the like).

1.4.4. Antibody Display

The present method can be used to shuffle, by in vitro and/or in vivorecombination by any of the disclosed methods, and in any combination,polynucleotide sequences selected by antibody display methods, whereinan associated polynucleotide encodes a displayed antibody which isscreened for a phenotype (e.g., for affinity for binding a predeterminedantigen (ligand).

Various prokaryotic expression systems have been developed that can bemanipulated to produce combinatorial antibody libraries which may bescreened for high-affinity antibodies to specific antigens. Recentadvances in the expression of antibodies in Escherichia coli andbacteriophage systems (see “alternative peptide display methods”, infra)have raised the possibility that virtually any specificity can beobtained by either cloning antibody genes from characterized hybridomasor by de novo selection using antibody gene libraries (e.g., from IgcDNA).

Combinatorial libraries of antibodies have been generated inbacteriophage lambda expression systems which may be screened asbacteriophage plaques or as colonies of lysogens (Huse et al, 1989);Caton and Koprowski, 1990; Mullinax et al, 1990; Persson et al, 1991).Various embodiments of bacteriophage antibody display libraries andlambda phage expression libraries have been described (Kang et al, 1991;Clackson et al, 1991; McCafferty et al, 1990; Burton et al, 1991;Hoogenboom et al, 1991; Chang et al, 1991; Breitling et al, 1991; Markset al, 1991, p. 581; Barbas et al, 1992; Hawkins and Winter, 1992; Markset al, 1992, p. 779; Marks et al, 1992, p. 16007; and Lowman et al,1991; Lerner et al, 1992; all incorporated herein by reference).Typically, a bacteriophage antibody display library is screened with areceptor (e.g., polypeptide, carbohydrate, glycoprotein, nucleic acid)that is immobilized (e.g., by covalent linkage to a chromatography resinto enrich for reactive phage by affinity chromatography) and/or labeled(e.g., to screen plaque or colony lifts).

One particularly advantageous approach has been the use of so-calledsingle-chain fragment variable (scfv) libraries (Marks et al, 1992, p.779; Winter and Milstein, 1991; Clackson et al, 1991; Marks et al, 1991,p. 581; Chaudhary et al, 1990; Chiswell et al, 1992; McCafferty et al,1990; and Huston et al, 1988). Various embodiments of scfv librariesdisplayed on bacteriophage coat proteins have been described.Bacteriophage display of scfv have already yielded a variety of usefulantibodies and antibody fusion proteins. A bispecific single chainantibody has been shown to mediate efficient tumor cell lysis (Gruber etal, 1994). Intracellular expression of an anti-Rev scfv has been shownto inhibit HIV-1 virus replication in vitro (Duan et al, 1994), andintracellular expression of an anti-p21rar, scfv has been shown toinhibit meiotic maturation of Xenopus oocytes (Biocca et al, 1993).Recombinant scfv which can be used to diagnose HIV infection have alsobeen reported, demonstrating the diagnostic utility of scfv (Lilley etal, 1994). Fusion proteins wherein an scFv is linked to a secondpolypeptide, such as a toxin or fibrinolytic activator protein, havealso been reported (Holvost et al, 1992; Nicholls et al, 1993).

Various methods have been reported for increasing the combinatorialdiversity of a scfv library to broaden the repertoire of binding species(idiotype spectrum). Enzymatic inverse PCR mutagenesis has been shown tobe a simple and reliable method for constructing relatively largelibraries of scfv site-directed hybrids (Stemmer et al, 1993), as haserror-prone PCR and chemical mutagenesis (Deng et al, 1994). Riechmann(Riechmann et al, 1993) showed semi-rational design of an antibody scfvfragment using site-directed randomization by degenerate oligonucleotidePCR and subsequent phage display of the resultant scfv hybrids. Barbas(Barbas et al, 1992) attempted to circumvent the problem of limitedrepertoire sizes resulting from using biased variable region sequencesby randomizing the sequence in a synthetic CDR region of a human tetanustoxoid-binding Fab.

Displayed peptide/polynucleotide complexes (library members) whichencode a variable segment peptide sequence of interest or a single-chainantibody of interest are selected from the library by an affinityenrichment technique. This is accomplished by means of a immobilizedmacromolecule or epitope specific for the peptide sequence of interest,such as a receptor, other macromolecule, or other epitope species.Repeating the affinity selection procedure provides an enrichment oflibrary members encoding the desired sequences, which may then beisolated for pooling and shuffling, for sequencing, and/or for furtherpropagation and affinity enrichment.

The library members without the desired specificity are removed bywashing. The degree and stringency of washing required will bedetermined for each peptide sequence or single-chain antibody ofinterest and the immobilized predetermined macromolecule or epitope. Acertain degree of control can be exerted over the bindingcharacteristics of the nascent peptide/DNA complexes recovered byadjusting the conditions of the binding incubation and the subsequentwashing. The temperature, pH, ionic strength, divalent cationsconcentration, and the volume and duration of the washing will selectfor nascent peptide/DNA complexes within particular ranges of affinityfor the immobilized macromolecule. Selection based on slow dissociationrate, which is usually predictive of high affinity, is often the mostpractical route. This may be done either by continued incubation in thepresence of a saturating amount of free predetermined macromolecule, orby increasing the volume, number, and length of the washes. In eachcase, the rebinding of dissociated nascent peptide/DNA or peptide/RNAcomplex is prevented, and with increasing time, nascent peptide/DNA orpeptide/RNA complexes of higher and higher affinity are recovered.

Additional modifications of the binding and washing procedures may beapplied to find peptides with special characteristics. The affinities ofsome peptides are dependent on ionic strength or cation concentration.This is a useful characteristic for peptides that will be used inaffinity purification of various proteins when gentle conditions forremoving the protein from the peptides are required.

One variation involves the use of multiple binding targets (multipleepitope species, multiple receptor species), such that a scfv librarycan be simultaneously screened for a multiplicity of scfv which havedifferent binding specificities. Given that the size of a scfv libraryoften limits the diversity of potential scfv sequences, it is typicallydesirable to us scfv libraries of as large a size as possible. The timeand economic considerations of generating a number of very largepolysome scFv-display libraries can become prohibitive. To avoid thissubstantial problem, multiple predetermined epitope species (receptorspecies) can be concomitantly screened in a single library, orsequential screening against a number of epitope species can be used. Inone variation, multiple target epitope species, each encoded on aseparate bead (or subset of beads), can be mixed and incubated with apolysome-display scfv library under suitable binding conditions. Thecollection of beads, comprising multiple epitope species, can then beused to isolate, by affinity selection, scfv library members. Generally,subsequent affinity screening rounds can include the same mixture ofbeads, subsets thereof, or beads containing only one or two individualepitope species. This approach affords efficient screening, and iscompatible with laboratory automation, batch processing, and highthroughput screening methods.

1.4.5. Expression Systems

The DNA expression constructs will typically include an expressioncontrol DNA sequence operably linked to the coding sequences, includingnaturally-associated or heterologous promoter regions. Preferably, theexpression control sequences will be eukaryotic promoter systems invectors capable of transforming or transfecting eukaryotic host cells.Once the vector has been incorporated into the appropriate host, thehost is maintained under conditions suitable for high level expressionof the nucleotide sequences, and the collection and purification of themutant' “engineered” antibodies.

The DNA sequences will be expressed in hosts after the sequences havebeen operably linked to an expression control sequence (i.e., positionedto ensure the transcription and translation of the structural gene).These expression vectors are typically replicable in the host organismseither as episomes or as an integral part of the host chromosomal DNA.Commonly, expression vectors will contain selection markers, e.g.,tetracycline or neomycin, to permit detection of those cells transformedwith the desired DNA sequences (see, e.g., U.S. Pat. No. 4,704,362,which is incorporated herein by reference).

In addition to eukaryotic microorganisms such as yeast, mammalian tissuecell culture may also be used to produce the polypeptides of the presentinvention (see Winnacker, 1987), which is incorporated herein byreference). Eukaryotic cells are actually preferred, because a number ofsuitable host cell lines capable of secreting intact immunoglobulinshave been developed in the art, and include the CHO cell lines, variousCOS cell lines, HeLa cells, and myeloma cell lines, but preferablytransformed Bcells or hybridomas. Expression vectors for these cells caninclude expression control sequences, such as an origin of replication,a promoter, an enhancer (Queen et al, 1986), and necessary processinginformation sites, such as ribosome binding sites, RNA splice sites,polyadenylation sites, and transcriptional terminator sequences.Preferred expression control sequences are promoters derived fromimmunoglobulin genes, cytomegalovirus, SV40, Adenovirus, BovinePapilloma Virus, and the like.

Eukaryotic DNA transcription can be increased by inserting an enhancersequence into the vector. Enhancers are cis-acting sequences of between10 to 300 bp that increase transcription by a promoter. Enhancers caneffectively increase transcription when either 5′ or 3′ to thetranscription unit. They are also effective if located within an intronor within the coding sequence itself. Typically, viral enhancers areused, including SV40 enhancers, cytomegalovirus enhancers, polyomaenhancers, and adenovirus enhancers. Enhancer sequences from mammaliansystems are also commonly used, such as the mouse immunoglobulin heavychain enhancer.

Mammalian expression vector systems will also typically include aselectable marker gene. Examples of suitable markers include, thedihydrofolate reductase gene (DHFR), the thymidine kinase gene (TK), orprokaryotic genes conferring drug resistance. The first two marker genesprefer the use of mutant cell lines that lack the ability to growwithout the addition of thymidine to the growth medium. Transformedcells can then be identified by their ability to grow onnon-supplemented media. Examples of prokaryotic drug resistance genesuseful as markers include genes conferring resistance to G418,mycophenolic acid and hygromycin.

The vectors containing the DNA segments of interest can be transferredinto the host cell by well-known methods, depending on the type ofcellular host. For example, calcium chloride transfection is commonlyutilized for prokaryotic cells, whereas calcium phosphate treatment.lipofection, or electroporation may be used for other cellular hosts.Other methods used to transform mammalian cells include the use ofPolybrene, protoplast fusion, liposomes, electroporation, andmicro-injection (see, generally, Sambrook et al, 1982 and 19891.

Once expressed, the antibodies, individual mutated immunoglobulinchains, mutated antibody fragments, and other immunoglobulinpolypeptides of the invention can be purified according to standardprocedures of the art, including ammonium sulfate precipitation,fraction column chromatography, gel electrophoresis and the like (see,generally, Scopes, 1982). Once purified, partially or to homogeneity asdesired, the polypeptides may then be used therapeutically or indeveloping and performing assay procedures, immunofluorescent stainings,and the like (see, generally, Lefkovits and Pernis, 1979 and 1981;Lefkovits, 1997).

1.4.6 Two-Hybrid Based Screening Assays

This invention provides for screening a two-hybrid screening system toidentify library members which bind a predetermined polypeptidesequence. The selected library members are pooled and shuffled by invitro and/or in vivo recombination. The shuffled pool can then bescreened in a yeast two hybrid system to select library members whichbind said predetermined polypeptide sequence (e.g., and SH2 domain) orwhich bind an alternate predetemnined polypeptide sequence (e.g., an SH2domain from another protein species).

An approach to identifying polypeptide sequences which bind to apredetermined polypeptide sequence has been to use a so-called“two-hybrid” system wherein the predetermined polypeptide sequence ispresent in a fusion protein (Chien et al, 1991). This approachidentifies protein-protein interactions in vivo through reconstitutionof a transcriptional activator (Fields and Song, 1989), the yeast Gal4transcription protein. Typically, the method is based on the propertiesof the yeast Gal4 protein, which consists of separable domainsresponsible for DNA-binding and transcriptional activation.Polynucleotides encoding two hybrid proteins, one consisting of theyeast Gal4 DNA-binding domain fused to a polypeptide sequence of a knownprotein and the other consisting of the Gal4 activation domain fused toa polypeptide sequence of a second protein, are constructed andintroduced into a yeast host cell. Intermolecular binding between thetwo fusion proteins reconstitutes the Gal4 DNA-binding domain with theGal4 activation domain, which leads to the transcriptional activation ofa reporter gene (e.g., lacz, HIS3) which is operably linked to a Gal4binding site. Typically, the two-hybrid method is used to identify novelpolypeptide sequences which interact with a known protein (Silver andHunt, 1993; Durfee et al, 1993; Yang et al, 1992; Luban et al, 1993;Hardy et al, 1992; Bartel et al, 1993; and Vojtek et al, 1993). However,variations of the two-hybrid method have been used to identify mutationsof a known protein that affect its binding to a second known protein (Liand Fields, 1993; Lalo et al, 1993; Jackson et al, 1993; and Madura etal, 1993). Two-hybrid systems have also been used to identifyinteracting structural domains of two known proteins (Bardwell et al,1993; Chakrabarty et al, 1992; Staudinger et al, 1993; and Milne andWeaver 1993) or domains responsible for oligomerization of a singleprotein (Iwabuchi et al, 1993; Bogerd et al, 1993). Variations oftwo-hybrid systems have been used to study the in vivo activity of aproteolytic enzyme (Dasmahapatra et al, 1992). Alternatively, an E.coli/BCCP interactive screening system (Germino et al, 1993; Guarente,1993) can be used to identify interacting protein sequences (i.e.,protein sequences which heterodimerize or form higher orderheteromultimers). Sequences selected by a two-hybrid system can bepooled and shuffled and introduced into a two-hybrid system for one ormore subsequent rounds of screening to identify polypeptide sequenceswhich bind to the hybrid containing the predetermined binding sequence.The sequences thus identified can be compared to identify consensussequence(s) and consensus sequence kernals.

1.4.7. Improved Methods for Cellular Engineering, Protein ExpressionProfiling, Differential Labeling of Peptides, and Novel ReagentsTherefore

In one embodiment, this invention relates to peptide chemistry,proteomics, and mass spectrometry technology. In particular, theinvention provides novel methods for determining polypeptide profilesand protein expression variations, as with proteome analyses. Thepresent invention provides methods of simultaneously identifying andquantifying individual proteins in complex protein mixtures by selectivedifferential labeling of amino acid residues followed by chromatographicand mass spectrographic analysis.

The diagnosis and treatment, as well as the predisposition of, a varietyof diseases and disorders may often be accomplished throughidentification and quantitative measurement of polypeptide expressionvariations between different cell types and cell states. Biochemicalpathways and metabolic networks can also be analyzed by globally andquantitatively measuring protein expression in various cell types andbiological states (see, e.g., Ideker (2001) Science 292: 929-934).

State-of-the-art techniques such asliquid-chromatography-electrospray-ionization tandem mass spectrometryhave, in conjunction with database-searching computer algorithms,revolutionized the analysis of biochemical species from complexbiological mixtures. With these techniques, it is now possible toperform high-throughput protein identification at picomolar tosubpicomolar levels from complex mixtures of biological molecules (see,e.g., Dongre (1997) Trends Biotechnol. 15: 418-425).

One such method is based on a class of chemical reagents termedisotope-coded affinity tags (ICATs) and tandem mass spectrometry. Themethod labels multiple cysteinyl residues and uses stable isotopedilution techniques. For example, Gygi (1999) Nat. Biotechnol. 10:994-999, compared protein expression in a yeast using ethanol orgalactose as a carbon source. The measured differences in proteinexpression correlated with known yeast metabolic function underglucose-repressed conditions.

In another technique, two different protein mixtures for quantitativecomparison are digested to peptide mixtures, the peptides mixtures areseparately methylated using either d0- or d3-methanol, the mixtures ofmethylated peptide combined and subjected to microcapillary HPLC-MS/MS(see, e.g., Goodlett, D. R., et al., (2000) “Differential stable isotopelabeling of peptides for quantitation and de novo sequence derivation,”49th ASMS; Zhou, H; Watts, J D; Aebersold, R. A systematic approach tothe analysis of protein phosphorylation. Comment In: Nat Biotechnol.2001 April; 19(4): 317-8; Nature Biotechnology 2001 April, 19(4):375-8). Parent proteins of methylated peptides are identified bycorrelative database searching of fragment ion spectra using a computerprogram assisted paradigms or automated de novo sequencing that comparesall tandem mass spectra of d0- and d3-methylated peptide ion pairs. InGoodlett (2000) supra, ratios of proteins in two different mixtures werecalculated for d0- to d3-methylated peptide pairs. However, there areseveral limitations to this approach, including: use of differentiallabeling reagents, which relied on stable isotopes, which are expensive,and not flexible to differential labeling of more than two mixtures ofpeptides; labeling methods limited only to methylation ofcarboxy-termini; protein expression profiling limited to duplexcomparison; one dimensional capillary HPLC chromatography was employedto separate peptides, which doesn't has enough capacity and resolvingpower for complex mixtures of peptides.

In one embodiment this invention provides a method for identifyingproteins by differential labeling of peptides, the method comprising thefollowing steps: (a) providing a sample comprising a polypeptide; (b)providing a plurality of labeling reagents which differ in molecularmass that can generate differential labeled peptides that do not differin chromatographic retention properties and do not differ in ionizationand detection properties in mass spectrographic analysis, wherein thedifferences in molecular mass are distinguishable by mass spectrographicanalysis; (c) fragmenting the polypeptide into peptide fragments byenzymatic digestion or by non-enzymatic fragmentation; (d) contactingthe labeling reagents of step (b) with the peptide fragments of step(c), thereby labeling the peptides with the differential labelingreagents; (e) separating the peptides by chromatography to generate aneluate; (f) feeding the eluate of step (e) into a mass spectrometer andquantifying the amount of each peptide and generating the sequence ofeach peptide by use of the mass spectrometer; (g) inputting the sequenceto a computer program product which compares the inputted sequence to adatabase of polypeptide sequences to identify the polypeptide from whichthe sequenced peptide originated.

In one aspect, the sample of step (a) comprises a cell or a cellextract. The method can further comprise providing two or more samplescomprising a polypeptide. One or more of the samples can be derived froma wild type cell and one sample can be derived from an abnormal or amodified cell. The abnormal cell can be a cancer cell. The modified cellcan be a cell that is mutagenized &/or treated with a chemical, aphysiological factor, or the presence of another organism (including,e.g. a eukaryotic organism, prokaryotic organism, virus, vector, prion,or part thereof), &/or exposed to an environmental factor or change orphysical force (including, e.g., sound, light, heat, sonication, andradiation). The modification can be genetic change (including, forexample, a change in DNA or RNA sequence or content) or otherwise. Inone aspect, the method further comprises purifying or fractionating thepolypeptide before the fragmenting of step (c). The method can furthercomprise purifying or fractionating the polypeptide before the labelingof step (d). The method can further comprise purifying or fractionatingthe labeled peptide before the chromatography of step (e). Inalternative aspects, the purifying or fractionating comprises a methodselected from the group consisting of size exclusion chromatography,size exclusion chromatography, HPLC, reverse phase HPLC and affinitypurification. In one aspect, the method further comprises contacting thepolypeptide with a labeling reagent of step (b) before the fragmentingof step (c).

In one aspect, the labeling reagent of step (b) comprises the generalformulae selected from the group consisting of: Z^(A)OH and Z^(B)OH, toesterify peptide C-terminals and/or Glu and Asp side chains; Z^(A)NH₂and Z^(B)NH₂, to form amide bond with peptide C-terminals and/or Glu andAsp side chains; and Z^(A)CO₂H and Z^(B)CO₂H. to form amide bond withpeptide N-terminals and/or Lys and Arg side chains; wherein Z^(A) andZ^(B) independently of one another comprise the general formulaR-Z¹-A¹-Z²-A²-Z³-A³-Z⁴-A⁴-, Z¹, Z², Z³, and Z⁴ independently of oneanother, are selected from the group consisting of nothing, 0, OC(O),OC(S), OC(O)O, OC(O)NR, OC(S)NR, OSiRR¹, S, SC(O), SC(S), SS, S(O),S(O₂), NR, NRR¹⁺, C(O), C(O)O, C(S), C(S)O, C(O)S, C(O)NR, C(S)NR,SiRR¹, (Si(RR¹)O)_(n), SnRR¹, Sn(RR¹)O, BR(OR¹), BRR¹, B(OR)(OR¹),OBR(OR¹), OBRR¹, and OB(OR)(OR¹), and R and R¹ is an alkyl group, A¹,A², A³, and A⁴ independently of one another, are selected from the groupconsisting of nothing or (CRR¹)_(n), wherein R, R¹, independently fromother R and R¹ in Z¹ to Z⁴ and independently from other R and R¹ in A1to A4, are selected from the group consisting of a hydrogen atom, ahalogen atom and an alkyl group; “n” in Z¹ to Z⁴, independent of n in A¹to A⁴, is an integer having a value selected from the group consistingof 0 to about 51; 0 to about 41; 0 to about 31; 0 to about 21, 0 toabout 11 and 0 to about 6.

In one aspect, the alkyl group (see definition below) is selected fromthe group consisting of an alkenyl, an alkynyl and an aryl group. One ormore C—C bonds from (CRR¹)_(n) can be replaced with a double or a triplebond; thus, in alternative aspects, an R or an R¹ group is deleted. The(CRR¹)_(n) can be selected from the group consisting of an o-arylene, anm-arylene and ap-arylene, wherein each group has none or up to 6substituents. The (CRR¹)_(n) can be selected from the group consistingof a carbocyclic, a bicyclic and a tricyclic fragment, wherein thefragment has up to 8 atoms in the cycle with or without a heteroatomselected from the group consisting of an O atom, a N atom and an S atom.

In one aspect, two or more labeling reagents have the same structure buta different isotope composition. For example, in one aspect, Z^(A) hasthe same structure as Z^(B), while Z^(A) has a different isotopecomposition than Z^(B). In alternative aspects, the isotope is boron-10and boron-11; carbon-12 and carbon-13; nitrogen-14 and nitrogen-15; and,sulfur-32 and sulfur-34. In one aspect, where the isotope with the lowermass is x and the isotope with the higher mass is y, and x and y areintegers, x is greater than y.

In alternative aspects, x and y are between 1 and about 11, between 1and about 21, between 1 and about 31, between 1 and about 41, or between1 and about 51.

In one aspect, the labeling reagent of step (b) comprises the generalformulae selected from the group consisting of:CD₃(CD₂)_(n)OH/CH₃(CH₂)_(n)OH, to esterify peptide C-terminals, wheren=0, 1, 2 or y; CD₃(CD₂)_(n)NH₂/CH₃(CH₂)_(n)NH₂, to form amide bond withpeptide C-terminals, where n=0, 1, 2 or y; and,D(CD₂)_(n)CO₂H/H(CH₂)_(n)CO₂H, to form amide bond with peptideN-terminals, where n=0, 1, 2 or y; wherein D is a deuteron atom, and yis an integer selected from the group consisting of about 51; about 41;about 31; about 21, about 11; about 6 and between about 5 and 51.

In one aspect, the labeling reagent of step (b) can comprise the generalformulae selected from the group consisting of Z^(A)OH and Z^(B)OH toesterify peptide C-terminals; Z^(A)NH₂/Z^(B)NH₂ to form an amide bondwith peptide C-terminals; and, Z^(A)CO₂H/Z^(B)CO₂H to form an amide bondwith peptide N-terminals; wherein Z^(A) and Z^(B) have the generalformula R-Z¹-A¹-Z²-A²-Z³-A³-Z⁴-A⁴-; Z¹, Z², Z³, and Z⁴, independently ofone another, are selected from the group consisting of nothing, 0,OC(O), OC(S), OC(O)O, OC(O)NR, OC(S)NR, OSiRR¹, S, SC(O), SC(S), SS,S(O), S(O₂), NR, NRR¹⁺, C(O), C(O)O, C(S), C(S)O, C(O)S, C(O)NR, C(S)NR,SiRR¹, (Si(RR¹)O)_(n), SnRR¹, Sn(RR¹)O, BR(OR¹), BRR¹, B(OR)(OR¹),OBR(OR¹), OBRR¹, and OB(OR)(OR¹); A¹, A², A³, and A⁴, independently ofone another, are selected from the group consisting of nothing and thegeneral formulae (CRR¹)_(n), and, R and R¹ is an alkyl group.

In one aspect, a single C—C bond in a (CRR¹)_(n) group is replaced witha double or a triple bond; thus, the R and R¹ can be absent. The(CRR¹)_(n) can comprise a moiety selected from the group consisting ofan o-arylene, an m-arylene and a p-arylene, wherein the group has noneor up to 6 substituents. The group can comprise a carbocyclic, abicyclic, or a tricyclic fragments with up to 8 atoms in the cycle, withor without a heteroatom selected from the group consisting of an O atom,an N atom and an S atom. In one aspect, R, R¹, independently from otherR and R¹ in Z¹-Z⁴ and independently from other R and R¹ in A¹-A⁴, areselected from the group consisting of a hydrogen atom, a halogen and analkyl group The alkyl group (see definition below) can be an alkenyl, analkynyl or an aryl group.

In one aspect, the “n” in Z¹-Z⁴ is independent of n in A¹-A⁴ and is aninteger selected from the group consisting of about 51; about 41; about31; about 21, about 11 and about 6. In one aspect, Z^(A) has the samestructure a Z^(B) but Z^(A) further comprises x number of —CH₂—fragment(s) in one or more A¹-A⁴ fragments, wherein x is an integer. Inone aspect, Z^(A) has the same structure a Z^(B) but Z^(A) furthercomprises x number of —CF₂— fragment(s) in one or more A¹-A⁴ fragments,wherein x is an integer. In one aspect, Z^(A) comprises x number ofprotons and Z^(B) comprises y number of halogens in the place ofprotons, wherein x and y are integers. In one aspect, Z^(A) contains xnumber of protons and Z^(B) contains y number of halogens, and there arex−y number of protons remaining in one or more A¹-A⁴ fragments, whereinx and y are integers. In one aspect, Z^(A) further comprises x number of—O— fragment(s) in one or more A¹-A⁴ fragments, wherein x is an integer.In one aspect, Z^(A) further comprises x number of —S— fragment(s) inone or more A¹-A⁴ fragments, wherein x is an integer. In one aspect,Z^(A) further comprises x number of —O— fragment(s) and Z^(B) furthercomprises y number of —S— fragment(s) in the place of —O— fragment(s),wherein x and y are integers. In one aspect, Z^(A) further comprises x−ynumber of —O— fragment(s) in one or more A¹-A⁴ fragments, wherein x andy are integers.

In alternative aspects, x and y are integers selected from the groupconsisting of between 1 about 51; between 1 about 41; between 1 about31; between 1 about 21, between 1 about 11 and between 1 about 6,wherein x is greater than y.

In one aspect, the labeling reagent of step (b) comprises the generalformulae selected from the group consisting of:CH₃(CH₂)_(n)OH/CH₃(CH₂)_(n+m)OH, to esterify peptide C-terminals, wheren=0, 1, 2, . . . , y; m=1, 2, . . . , y;CH₃(CH₂)_(n)NH₂/CH₃(CH₂)_(n+m)NH₂, to form amide bond with peptideC-terminals, where n=0, 1, 2, . . . , y; m=1, 2, . . . , y; and,H(CH₂)_(n)CO₂H/H(CH₂)_(n+m)CO₂H, to form amide bond with peptideN-terminals, where n=0, 1, 2, . . . , y; m=1, 2, . . . , y; wherein n, mand y are integers. In one aspect, n, m and y are integers selected fromthe group consisting of about 51; about 41; about 31; about 21, about11; about 6 and between about 5 and 51.

In one aspect, the separating of step (e) comprises a liquidchromatography system, such as a multidimensional liquid chromatographyor a capillary chromatography system. In one aspect, the massspectrometer comprises a tandem mass spectrometry device. In one aspect,the method further comprises quantifying the amount of each polypeptideor each peptide.

The invention provides a method for defining the expressed proteinsassociated with a given cellular state, the method comprising thefollowing steps: (a) providing a sample comprising a cell in the desiredcellular state; (b) providing a plurality of labeling reagents whichdiffer in molecular mass that can generate differential labeled peptidesthat do not differ in chromatographic retention properties and do notdiffer in ionization and detection properties in mass spectrographicanalysis, wherein the differences in molecular mass are distinguishableby mass spectrographic analysis; (c) fragmenting polypeptides derivedfrom the cell into peptide fragments by enzymatic digestion or bynon-enzymatic fragmentation; (d) contacting the labeling reagents ofstep (b) with the peptide fragments of step (c), thereby labeling thepeptides with the differential labeling reagents; (e) separating thepeptides by chromatography to generate an eluate; (f) feeding the eluateof step (e) into a mass spectrometer and quantifying the amount of eachpeptide and generating the sequence of each peptide by use of the massspectrometer; (g) inputting the sequence to a computer program productwhich compares the inputted sequence to a database of polypeptidesequences to identify the polypeptide from which the sequenced peptideoriginated, thereby defining the expressed proteins associated with thecellular state.

The invention provides a method for quantifying changes in proteinexpression between at least two cellular states, the method comprisingthe following steps: (a) providing at least two samples comprising cellsin a desired cellular state; (b) providing a plurality of labelingreagents which differ in molecular mass that can generate differentiallabeled peptides that do not differ in chromatographic retentionproperties and do not differ in ionization and detection properties inmass spectrographic analysis, wherein the differences in molecular massare distinguishable by mass spectrographic analysis; (c) fragmentingpolypeptides derived from the cells into peptide fragments by enzymaticdigestion or by non-enzymatic fragmentation; (d) contacting the labelingreagents of step (b) with the peptide fragments of step (c), therebylabeling the peptides with the differential labeling reagents, whereinthe labels used in one same are different from the labels used in othersamples; (e) separating the peptides by chromatography to generate aneluate; (f) feeding the eluate of step (e) into a mass spectrometer andquantifying the amount of each peptide and generating the sequence ofeach peptide by use of the mass spectrometer; (g) inputting the sequenceto a computer program product which identifies from which sample eachpeptide was derived, compares the inputted sequence to a database ofpolypeptide sequences to identify the polypeptide from which thesequenced peptide originated, and compares the amount of eachpolypeptide in each sample, thereby quantifying changes in proteinexpression between at least two cellular states.

The invention provides a method for identifying proteins by differentiallabeling of peptides, the method comprising the following steps: (a)providing a sample comprising a polypeptide; (b) providing a pluralityof labeling reagents which differ in molecular mass but do not differ inchromatographic retention properties and do not differ in ionization anddetection properties in mass spectrographic analysis, wherein thedifferences in molecular mass are distinguishable by mass spectrographicanalysis; (c) fragmenting the polypeptide into peptide fragments byenzymatic digestion or by non-enzymatic fragmentation; (d) contactingthe labeling reagents of step (b) with the peptide fragments of step(c), thereby labeling the peptides with the differential labelingreagents; (e) separating the peptides by multidimensional liquidchromatography to generate an eluate; (f) feeding the eluate of step (e)into a tandem mass spectrometer and quantifying the amount of eachpeptide and generating the sequence of each peptide by use of the massspectrometer; (g) inputting the sequence to a computer program productwhich compares the inputted sequence to a database of polypeptidesequences to identify the polypeptide from which the sequenced peptideoriginated.

The invention provides a chimeric labeling reagent comprising (a) afirst domain comprising a biotin; and (b) a second domain comprising areactive group capable of covalently binding to an amino acid, whereinthe chimeric labeling reagent comprises at least one isotope. Theisotope(s) can be in the first domain or the second domain. For example,the isotope(s) can be in the biotin.

In alternative aspects, the isotope can be a deuterium isotope, aboron-10 or boron-11 isotope, a carbon-12 or a carbon-13 isotope, anitrogen-14 or a nitrogen-15 isotope, or, a sulfur-32 or a sulfur-34isotope. The chimeric labeling reagent can comprise two or moreisotopes. The chimeric labeling reagent reactive group capable ofcovalently binding to an amino acid can be a succimide group, anisothiocyanate group or an isocyanate group. The reactive group can becapable of covalently binding to an amino acid binds to a lysine or acysteine.

The chimeric labeling reagent can further comprising a linker moietylinking the biotin group and the reactive group. The linker moiety cancomprise at least one isotope. In one aspect, the linker is a cleavablemoiety that can be cleaved by, e.g., enzymatic digest or by reduction.

The invention provides a method of comparing relative proteinconcentrations in a sample comprising (a) providing a plurality ofdifferential small molecule tags, wherein the small molecule tags arestructurally identical but differ in their isotope composition, and thesmall molecules comprise reactive groups that covalently bind tocysteine or lysine residues or both; (b) providing at least two samplescomprising polypeptides; (c) attaching covalently the differential smallmolecule tags to amino acids of the polypeptides; (d) determining theprotein concentrations of each sample in a tandem mass spectrometer;and, (d) comparing relative protein concentrations of each sample. Inone aspect, the sample comprises a complete or a fractionated cellularsample.

In one aspect of the method, the differential small molecule tagscomprise a chimeric labeling reagent comprising (a) a first domaincomprising a biotin; and, (b) a second domain comprising a reactivegroup capable of covalently binding to an amino acid, wherein thechimeric labeling reagent comprises at least one isotope. The isotopecan be a deuterium isotope, a boron-10 or boron-11 isotope, a carbon-12or a carbon-13 isotope, a nitrogen-14 or a nitrogen-15 isotope, or, asulfur-32 or a sulfur-34 isotope. The chimeric labeling reagent cancomprise two or more isotopes. The reactive group can be capable ofcovalently binding to an amino acid is selected from the groupconsisting of a succimide group, an isothiocyanate group and anisocyanate group.

The invention provides a method of comparing relative proteinconcentrations in a sample comprising (a) providing a plurality ofdifferential small molecule tags, wherein the differential smallmolecule tags comprise a chimeric labeling reagent comprising (i) afirst domain comprising a biotin; and, (ii) a second domain comprising areactive group capable of covalently binding to an amino acid, whereinthe chimeric labeling reagent comprises at least one isotope; (b)providing at least two samples comprising polypeptides; (c) attachingcovalently the differential small molecule tags to amino acids of thepolypeptides; (d) isolating the tagged polypeptides on a biotin-bindingcolumn by binding tagged polypeptides to the column, washing non-boundmaterials off the column, and eluting tagged polypeptides off thecolumn; (e) determining the protein concentrations of each sample in atandem mass spectrometer; and, (f) comparing relative proteinconcentrations of each sample. The details of one or more embodiments ofthe invention are set forth in the accompanying drawings and thedescription below. Other features, objects, and advantages of theinvention will be apparent from the description and drawings, and fromthe claims. All publications, patents and patent applications citedherein are hereby expressly incorporated by reference for all purposes.

The invention provides methods for simultaneously identifying individualproteins in complex mixtures of biological molecules and quantifying theexpression levels of those proteins, e.g., proteome analyses. Themethods compare two or more samples of proteins, one of which can beconsidered as the standard sample and all others can be considered assamples under investigation. The proteins in the standard andinvestigated samples are subjected separately to a series of chemicalmodifications, i.e., differential chemical labeling, and fragmentation,e.g., by proteolytic digestion and/or other enzymatic reactions orphysical fragmenting methodologies. The chemical modifications can bedone before, or after, or before and after fragmentation/digestion ofthe polypeptide into peptides.

Peptides derived from the standard and the investigated samples arelabeled with chemical residues of different mass, but of similarproperties, such that peptides with the same sequence from both samplesare eluted together in the separation procedure and their ionization anddetection properties regarding the mass spectrometry are very similar.Differential chemical labeling can be performed on reactive functionalgroups on some or all of the carboxy- and/or amino-termini of proteinsand peptides and/or on selected amino acid side chains. A combination ofchemical labeling, proteolytic digestion and other enzymatic reactionsteps, physical fragmentation and/or fractionation can provide access toa variety of residues to general different specifically labeled peptidesto enhance the overall selectivity of the procedure.

The standard and the investigated samples are combined, subjected tomultidimensional chromatographic separation, and analyzed by massspectrometry methods. Mass spectrometry data is processed by specialsoftware, which allows for identification and quantification of peptidesand proteins.

Depending on the complexity and composition of the protein samples, itmay be desirable, or be necessary, to perform protein fractionationusing such methods as size exclusion, ion exchange, reverse phase, orother methods of affinity purifications prior to one or more chemicalmodification steps, proteolytic digestion or other enzymatic reactionsteps, or physical fragmentation steps.

The combined mixtures of peptides are first separated by achromatography method, such as a multidimensional liquid chromatography,system, before being fed into a coupled mass spectrometry device, suchas a tandem mass spectrometry device. The combination ofmultidimensional liquid chromatography and tandem mass spectrometry canbe called “LC-LC-MS/MS.” LC-LC-MS/MS was first developed by Link A. andYates J. R., as described, e.g., by Link (1999) Nature Biotechnology 17:676-682; Link (1999) Electrophoresis 18: 1314-1334; Washburn, M P;Wolters, D; Yates, J R, Nature Biotechnology 2001 March, 19(3): 242-7.

In practicing the methods of the invention, proteins can be firstsubstantially or partially isolated from the biological samples ofinterest. The polypeptides can be treated before selective differentiallabeling; for example, they can be denatured, reduced, preparations canbe desalted, and the like. Conversion of samples of proteins intomixtures of differentially labeled peptides can include preliminarychemical and/or enzymatic modification of side groups and/or termini;proteolytic digestion or fragmentation; post-digestion orpost-fragmentation chemical and/or enzymatic modification of side groupsand/or termini.

The differentially modified polypeptides and peptides are then combinedinto one or more peptide mixtures. Solvent or other reagents can beremoved, neutralized or diluted, if desired or necessary. The buffer canbe modified, or, the peptides can be redissolved in one or moredifferent buffers, such as a “MudPIT” (see below) loading buffer. Thepeptide mixture is then loaded onto chromatography column, such as aliquid chromatography column, a 2D capillary column or amultidimensional chromatography column, to generate an eluate.

The eluate is fed into a mass spectrometer, such as a tandem massspectrometer. In one aspect, an LC ESI MS and MS/MS analysis iscomplete. Finally, data output is processed by appropriate softwareusing database searching and data analysis.

In practicing the methods of the invention, high yields of peptides cangenerated for mass spectrograph analysis. Two or more samples can bedifferentially labeled by selective labeling of each sample. Peptidemodifications, i.e., labeling, are stable. Reagents having differingmasses or reactive groups can be chosen to maximize the number ofreactive groups and differentially labeled samples, thus allowing for amultiplex analysis of sample, polypeptides and peptides. In one aspect,a “MudPIT” protocol is used for peptide analysis, as described herein.The methods of the invention can be fully automated and can essentiallyanalyze every protein in a sample.

Definitions

Unless defined otherwise, all technical and scientific terms used hereinhave the meaning commonly understood by a person skilled in the art towhich this invention belongs. As used herein, the following terms havethe meanings ascribed to them unless specified otherwise.

As used herein, the term “alkyl” is used to refer to a genus ofcompounds including branched or unbranched, saturated or unsaturated,monovalent hydrocarbon radicals, including substituted derivatives andequivalents thereof. In one aspect, the hydrocarbons have from about 1to about 100 carbons, about 1 to about 50 carbons or about 1 to about 30carbons, about 1 to about 20 carbons, about 1 to about 10 carbons. Whenthe alkyl group has from about 1 to 6 carbon atoms, it is referred to asa “lower alkyl.” Suitable alkyl radicals include, e.g., structurescontaining one or more methylene, methine and/or methyne groups arrangedin acyclic and/or cyclic forms. Branched structures have a branchingmotif similar to isopropyl, tert-butyl isobutyl, 2-ethylpropyl, etc. Asused herein, the term encompasses “substituted alkyls.” “Substitutedalkyl” refers to alkyl as just described including one or morefunctional groups such as lower alkyl, aryl, acyl, halogen (i.e.,alkylhalos, e.g., CF3), hydroxy, amino, alkoxy, alkylamino, acylamino,thioamido, acyloxy, aryloxy, arylamino, aryloxyalkyl, mercapto, thia,aza, oxo, both saturated and unsaturated cyclic hydrocarbons,heterocycles and the like. These groups may be attached to any carbon ofthe alkyl moiety. Additionally, these groups may be pendent from, orintegral to, the alkyl chain.

The term “alkoxy” is used herein to refer to the to a COR group, where Ris a lower alkyl, substituted lower alkyl, aryl, substituted aryl,arylalkyl or substituted arylalkyl wherein the alkyl, aryl, substitutedaryl, arylalkyl and substituted arylalkyl groups are as describedherein. Suitable alkoxy radicals include, for example, methoxy, ethoxy,phenoxy, substituted phenoxy, benzyloxy phenethyloxy, tert.-butoxy, etc.The term “aryl” is used herein to refer to an aromatic substituent thatmay be a single aromatic ring or multiple aromatic rings which are fusedtogether, linked covalently, or linked to a common group such as amethylene or ethylene moiety. The common linking group may also be acarbonyl as in benzophenone. The aromatic ring(s) may include phenyl,naphthyl, biphenyl, diphenylmethyl and benzophenone among others. Theterm “aryl” encompasses “arylalkyl.” “Substituted aryl” refers to arylas just described including one or more functional groups such as loweralkyl, acyl, halogen, alkylhalos (e.g., CF3), hydroxy, amino, alkoxy,alkylamino, acylamino, acyloxy, phenoxy, mercapto and both saturated andunsaturated cyclic hydrocarbons which are fused to the aromatic ring(s),linked covalently or linked to a common group such as a methylene orethylene moiety. The linking group may also be a carbonyl such as incyclohexyl phenyl ketone. The term “substituted aryl” encompasses“substituted arylalkyl.”

The term “arylalkyl” is used herein to refer to a subset of “aryl” inwhich the aryl group is further attached to an alkyl group, as definedherein.

The term “biotin” as used herein refers to any natural or syntheticbiotin or variant thereof, which are well known in the art; ligands forbiotin, and ways to modify the affinity of biotin for a ligand, are alsowell known in the art; see, e.g., U.S. Pat. Nos. 6,242,610; 6,150,123;6,096,508; 6,083,712; 6,022,688; 5,998,155; 5,487,975.

The phrase “labeling reagents which . . . do not differ in ionizationand detection properties in mass spectrographic analysis” means that theamount and/or mass sequence of the labeling reagents can be detectedusing the same mass spectrographic conditions and detection devices.

The term “polypeptide” includes natural and synthetic polypeptides, ormimetics, which can be either entirely composed of synthetic,non-natural analogues of amino acids, or, they can be chimeric moleculesof partly natural peptide amino acids and partly non-natural analogs ofamino acids. The term “polypeptide” as used herein includes proteins andpeptides of all sizes.

The term “sample” as used herein includes any polypeptide-containingsample, including samples from natural sources, or, entirely syntheticsamples.

The term “column” as used herein means any substrate surface, includingbeads, filaments, arrays, tubes and the like.

The phrase “do not differ in chromatographic retention properties” asused herein means that two compositions have substantially, but notnecessary exactly, the same retention properties in a chromatograph,such as a liquid chromatograph. For example, two compositions do notdiffer in chromatographic retention properties if they elute together,i.e., they elute in what a skilled artisan would consider the sameelution fraction.

Differential Labeling of Peptides and Polypeptides

In practicing the methods of the invention, proteins and peptides aresubjected to a series of chemical modifications, i.e., differentialchemical labeling. The chemical modifications can be done before, orafter, or before and after fragmentation/digestion of the polypeptideinto peptides. Differential labeling reagents can differ in theirisotope composition (i.e., isotopical reagents), in their structuralcomposition (i.e., homologous reagents), but by a rather small fragmentwhich change does not alter the properties stated above, i.e., thelabeling reagent differ in molecular mass but do not differ inchromatographic retention properties and do not differ in ionization anddetection properties in mass spectrographic analysis, and thedifferences in molecular mass are distinguishable by mass spectrographicanalysis.

In one aspect of the invention, mixtures of polypeptides and/or peptidescoming from the “standard” protein sample and the “investigated” proteinsample(s) are labeled separately with differential reagents, or, onesample is labeled and other sample remains unlabeled. As noted above,these differential reagents differ in molecular mass, but do not differin retention properties regarding the separation method used (e.g.,chromatography) and the mass spectrometry methods used will not detectdifferent ionization and detection properties. Thus, these differentialreagents differ either in their isotope composition (i.e., they areisotopical reagents) or they differ structurally by a rather smallfragment which change does not alter the properties stated above (i.e.,they are homologous reagents).

Differential chemical labeling can include esterification of C-termini,amidation of C-termini and/or acylation of N-termini. Esterificationtargets C-termini of peptides and carboxylic acid groups in amino acidside chains. Amidation targets C-termini of peptides and carboxylic acidgroups in amino acid side chains. Amidation may require protection ofamine groups first. Acylation targets N-termini of peptides and aminoand hydroxy groups in amino acid side chains. Acylation may requireprotection of carboxylic groups first.

The skilled artisan will recognize that the chemical syntheses anddifferential chemical labeling of peptides and polypeptides (e.g.,esterification, amidation, and acylation) used to practice the methodsof the invention can be by a variety of procedures and methodologies,which are well described in the scientific and patent literature, e.g.,Organic Syntheses Collective Volumes, Gilman et al. (Eds), John Wiley &Sons, Inc., NY; Venuti (1989) Pharm. Res. 6: 867-873; the BeilsteinHandbook of Organic Chemistry (Beilstein Institut fuer Literatur derOrganischen Chemie, Frankfurt, Germany); Beilstein online database andreferences obtainable therein; “Organic Chemistry,” Morrison & Boyd, 7thedition, 1999, Prentice-Hall, Upper Saddle River, N.J. The invention canbe practiced in conjunction with any method or protocol known in theart, which are well described in the scientific and patent literature.For example, the esterification, amidation, and acylation reactions maybe performed on the mixtures of peptides in a fashion similar to otherreaction of these types already described in prior art, such as:

In alternative aspects, reagents comprise the general formulae:

-   -   i. Z^(A)OH and Z^(B)OH to esterify peptide C-terminals and/or        Glu and Asp side chains;    -   ii. Z^(A)NH₂/Z^(B)NH₂ to form amide bond with peptide        C-terminals and/or Glu and Asp side chains; or    -   iii. Z^(A)CO₂H/Z^(B)CO₂H to form amide bond with peptide        N-terminals and/or Lys and Arg side chains;        wherein Z^(A) and Z^(B) independently of one another can be        R-Z¹-A¹-Z²-A²-Z³-A³-Z⁴-A⁴-, and Z¹, Z², Z³, and Z⁴ independently        of one another can be selected from 0, OC(O), OC(S), OC(O)O,        OC(O)NR, OC(S)NR, OSiRR¹, S, SC(O), SC(S), SS, S(O), S(O₂), NR,        NRR¹⁺, C(O), C(O)O, C(S), C(S)O, C(O)S, C(O)NR, C(S)NR, SiRR¹,        (Si(RR¹)O)_(n), SnRR¹, Sn(RR¹)O, BR(OR¹), BRR¹, B(OR)(OR¹),        OBR(OR¹), OBRR¹, OB(OR)(OR¹), or, Z¹, Z², Z³, and Z⁴        independently of one another may be absent, and R is an alkyl        group; and, A¹, A², A³, and A⁴ independently of one another can        be selected from (CRR¹)_(n), and R is an alkyl group. In        alternative aspects, some single C—C bonds from (CRR¹)_(n) may        be replaced with double or triple bonds, in which case some        groups R and R¹ will be absent, (CRR¹)_(n) can be an o-arylene,        an m-arylene, or a p-arylene with up to 6 substituents,        carbocyclic, bicyclic, or tricyclic fragments with up to 8 atoms        in the cycle with or without heteroatoms (O, N, S) and with or        without substituents, or A¹, A², A³, and A⁴ independently of one        another can be absent; R, R¹, independently from other R and R¹        in Z¹-Z⁴ and independently from other R and R¹ in A¹-A⁴, can be        hydrogen, halogen or an alkyl group, such as an alkenyl, an        alkynyl or an aryl group; n in Z¹-Z⁴, independent of n in A¹-A⁴,        is an integer that can have value from 0 to about 51; 0 to about        41; 0 to about 31; 0 to about 21, 0 to about 11; 0 to about 6;    -   In alternative aspects, Z^(A) has the same structure as Z^(B),        but they have different isotope compositions. Any isotope may be        used. In alternative aspects, if Z^(A) contains x number of        protons, Z^(B) may contain y number of deuterons in the place of        protons, and, correspondingly, x−y number of protons remaining;        and/or if Z^(A) contains x number of borons-10, Z^(B) may        contain y number of borons-11 in the place of borons-10, and,        correspondingly, x−y number of borons-10 remaining; and/or if        Z^(A) contains x number of carbons-12, Z^(B) may contain y        number of carbons-13 in the place of carbons-12, and,        correspondingly, x−y number of carbons-12 remaining; and/or if        Z^(A) contains x number of nitrogens-14, Z^(B) may contain y        number of nitrogens-15 in the place of nitrogens-14, and,        correspondingly, x−y number of nitrogens-14 remaining; and/or if        Z^(A) contains x number of sulfurs-32, Z^(B) may contain y        number of sulfurs-34 in the place of sulfurs-32, and,        correspondingly, x−y number of sulfurs-32 remaining; and so on        for all elements which may be present and have different stable        isotopes; x and y are whole numbers such that x is greater        than y. In one aspect, x and y are between 1 and about 11,        between 1 and about 21, between 1 and about 31, between 1 and        about 41, between 1 and about 51.

In alternative aspects, reagent pairs/series comprise the generalformulae:

-   -   i. CD₃(CD₂)_(n)OH/CH₃(CH₂)_(n)OH to esterify peptide        C-terminals, where n=0, 1, 2, . . . , y; (delta mass=3+2n);    -   ii. CD₃(CD₂)_(n)NH₂/CH₃(CH₂)_(n)NH₂ to form amide bond with        peptide C-terminals where n=0, 1, 2, . . . , y (delta        mass=3+2n);    -   iii. D(CD₂)_(n)CO₂H/H(CH₂)_(n)CO₂H to form amide bond with        peptide N-terminals, where n=0, 1, 2, . . . , y (delta        mass=1+2n);        -   wherein y is an integer that can have value of about 51;            about 41; about 31; about 21, about 11; about 6, or between            about 5 and 51.

Other exemplary reagents can be presented by general formulae:

-   -   i. Z^(A)OH and Z^(B)OH to esterify peptide C-terminals;    -   ii. Z^(A)NH₂/Z^(B)NH₂ to form an amide bond with peptide        C-terminals;    -   iii. Z^(A)CO₂H/Z^(B)CO₂H to form an amide bond with peptide        N-terminals;    -   wherein Z^(A) and Z^(B) can be R-Z¹-A¹-Z²-A²-Z³-A³-Z⁴-A⁴-    -   and Z¹, Z², Z³, and Z⁴, independently of one another, can be        selected from 10, OC(O), OC(S), OC(O)O, OC(O)NR, OC(S)NR,        OSiRR¹, S, SC(O), SC(S), SS, S(O), S(O₂), NR, NRR¹⁺, C(O),        C(O)O, C(S), C(S)O, C(O)S, C(O)NR, C(S)NR, SiRR¹,        (Si(RR¹)O)_(n), SnRR¹, Sn(RR¹)O, BR(OR¹), BRR¹, B(OR)(OR¹),        OBR(OR¹), OBRR¹ or OB(OR)(OR¹); or, Z¹, Z², Z³, and Z⁴,        independently of one another, can be absent, and, R is an alkyl        group;    -   A¹, A², A³, and A⁴, independently of one another, can be a        moiety comprising the general formulae (CRR¹)_(n). In        alternative aspects, single C—C bonds in some (CRR¹)_(n) groups        may be replaced with double or triple bonds, in which case some        groups R and R¹ will be absent, or (CRR¹)_(n) can be an        o-arylene, an m-arylene, or a p-arylene with up to 6        substituents, or a carbocyclic, a bicyclic, or a tricyclic        fragments with up to 8 atoms in the cycle, with or without        heteroatoms (e.g., O, N or S atoms), or, with or without        substituents, or, A¹-A⁴ independently of one another may be        absent;    -   In alternative aspects, R, R¹, independently from other R and R¹        in Z¹-Z⁴ and independently from other R and R¹ in A¹-A⁴, can be        a hydrogen atom, a halogen or an alkyl group, such as an        alkenyl, an alkynyl or an aryl group;    -   In alternative aspects, n in Z¹-Z⁴ is independent of n in A¹-A⁴        and is an integer that can have value of about 51; about 41;        about 31; about 21, about 11; about 6.

In alternative aspects, Z^(A) has a similar structure to that of Z^(B),but Z^(A) has x extra —CH₂— fragment(s) in one or more A¹-A⁴ fragments,and/or Z^(A) has x extra —CF₂— fragment(s) in one or more A¹-A⁴fragments. Alternatively, Z^(A) can contain x number of protons andZ^(B) may contain y number of halogens in the place of protons.Alternatively, where Z^(A) contains x number of protons and Z^(B)contains y number of halogens, there are x−y number of protons remainingin one or more A¹-A⁴ fragments; and/or Z^(A) has x extra —O— fragment(s)in one or more A¹-A⁴ fragments; and/or Z^(A) has x extra —S— fragment(s)in one or more A¹-A⁴ fragments; and/or if Z^(A) contains x number of —O—fragment(s), Z^(B) may contain y number of-S— fragment(s) in the placeof —O— fragment(s), and, correspondingly,

x−y number of —O— fragment(s) remaining in one or more A¹-A⁴ fragments;and the like.

In alternative aspects, x and y are integers that can have value ofbetween 1 about 51; of between 1 about 41; of between 1 about 31; ofbetween 1 about 21, of between 1 about 11; of between 1 about 6, suchthat x is greater than y.

Exemplary homologous reagents pairs/series are

-   -   i. CH₃(CH₂)_(n)OH/CH₃(CH₂)_(n+m)OH to esterify peptide        C-terminals, where n=0, 1, 2, . . . , y; m=1, 2, . . . , y        (delta mass=14m)    -   ii. CH₃(CH₂)_(n) NH₂/CH₃(CH₂)_(n+m)NH₂ to form amide bond with        peptide C-terminals, where n=0, 1, 2, . . . , y; m=1, 2, . . . ,        y (delta mass=14m)    -   iii. H(CH₂)_(n)CO₂H/H(CH₂)_(n+m)CO₂H to form amide bond with        peptide N-terminals, where n=0, 1, 2, . . . , y; m=1, 2, . . . ,        y (delta mass=14m) wherein y is an integer that can have value        of about 51; about 41; about 31; about 21, about 11; about 6, or        between about 5 and 51.        Methods for Peptide/Protein Separation and Detection

The methods of the invention use chromatographic techniques to separatetagged polypeptides and peptides. In one aspect, a liquid chromatographyis used, e.g., a multidimensional liquid chromatography. Thechromatogram eluate is coupled to a mass spectrometer, such as a tandemmass spectrometry device (e.g., a “LC-LC-MS/MS” system). Any variationand equivalent thereof can be used to separate and detect peptides.LC-LC-MS/MS was first developed by Link A. and Yates J. R., asdescribed, e.g., in (Link (1999) Nature Biotechnology 17: 676-682; Link(2000) Electrophoresis 18, 1314-1334. In one aspect, the LC-LC-MS/MStechnique is used; it is effective for complexed peptide separation andit is easily automated. LC-LC-MS/MS is commonly known by the acronym“MudPIT,” for “Multi-dimensional Protein Identification Technique.”

Variations and equivalents of LC-LC-MS/MS used in the methods of theinvention include methodologies involving reversed phase columns coupledto either cation exchange columns (as described, e.g., by Opiteck (1997)Anal. Chem. 69: 1518-1524; or, size exclusion columns (as described,e.g., by Opiteck (1997) Anal. Biochem. 258: 349-361). In one aspect, anLC-LC-MS/MS technique uses a mixed bed microcapillary column containingstrong cation exchange (SCX) and reversed phase (RPC) resins. Otherexemplary alternatives include protein fractionation combined withone-dimensional LC-ESI MS/MS or peptide fractionation combined MALDIMS/MS.

Depending on the complexity or the property of the protein samples, anyprotein fractionation method, including size exclusion chromatography,ion exchange chromatography, reverse phase chromatography, or any of thepossible affinity purifications, can be introduced prior to labeling andproteolysis. In some circumstances, use of several different methods maybe necessary to identify all proteins or specific proteins in a sample.

Sequence Analysis and Quantification

Both quantity and sequence identity of the protein from which themodified peptide originated can be determined by a mass spectrometrydevice, such as a “multistage mass spectrometry” (MS). This can beachieved by the operation of the mass spectrometer in a dual mode inwhich it alternates in successive scans between measuring the relativequantities of peptides eluting from the capillary column and recordingthe sequence information of selected peptides. Peptides are quantifiedby measuring in the MS mode the relative signal intensities for pairs orseries of peptide ions of identical sequence that are taggeddifferentially, which therefore differ in mass by the mass differentialencoded within the differential labeling reagents.

Peptide sequence information can be automatically generated by selectingpeptide ions of a particular mass-to-charge (m/z) ratio forcollision-induced dissociation (CID) in the mass spectrometer operatingin the tandem MS mode, as described, e.g., by Link (1997)Electrophoresis 18: 1314-1334; Gygi (1999) Nature Biotechnol. 17:994-999; Gygi (1999) Cell Biol. 19: 1720-1730.

The resulting tandem mass spectra can be correlated to sequencedatabases to identify the protein from which the sequenced peptideoriginated. Exemplary commercial available softwares include TURBOSEQUEST™ by Thermo Finnigan, San Jose, Calif.; MASSSCOT™ by MatrixScience, SONAR MS/MS™ by Proteometrics. Routine software modificationsmay be necessary for automated relative quantification.

Mass Spectrometry Devices

In the methods of the invention use mass spectrometry to identify andquantify differentially labeled peptides and polypeptides. Any massspectrometry system can be used. In one aspect of the invention,combined mixtures of peptides are separated by a chromatography methodcomprising multidimensional liquid chromatography coupled to tandem massspectrometry, or, “LC-LC-MS/MS,” see, e.g., Link (1999) Biotechnology17: 676-682; Link (1999) Electrophoresis 18: 1314-1334. Exemplary, massspectrometry devices include those incorporating matrix-assisted laserdesorption-ionization-time-of-flight (MALDI-TOF) mass spectrometry (see,e.g., Isola (2001) Anal. Chem. 73: 2126-2131; Van de Water (2000)Methods Mol. Biol. 146: 453459; Griffin (2000) Trends Biotechnol. 18:77-84; Ross (2000) Biotechniques 29: 620-626, 628-629). The inherenthigh molecular weight resolution of MALDI-TOF MS conveys highspecificity and good signal-to-noise ratio for performing accuratequantitation.

Use of mass spectrometry, including MALDI-TOF MS, and its use indetecting nucleic acid hybridization and in nucleic acid sequencing, iswell known in the art, see, e.g., U.S. Pat. Nos. 6,258,538; 6,238,871;6,238,869; 6,235,478; 6,232,066; 6,228,654; 6,225,450; 6,051,378;6,043,031.

Fragmentation and Proteolytic Digestion

In practicing the methods of the invention, polypeptides are fragmented,e.g., by proteolytic, i.e., enzymatic, digestion and/or other enzymaticreactions or physical fragmenting methodologies. The fragmentation canbe done before and/or after reacting the peptides/polypeptides with thelabeling reagents used in the methods of the invention.

Methods for proteolytic cleavage of polypeptides are well known in theart, e.g., enzymes include trypsin (see, e.g., U.S. Pat. Nos. 6,177,268;4,973,554), chymotrypsin (see, e.g., U.S. Pat. Nos. 4,695,458;5,252,463), elastase (see, e.g., U.S. Pat. No. 4,071,410); subtilisin(see, e.g., U.S. Pat. No. 5,837,516) and the like.

In one aspect, a chimeric labeling reagent of the invention includes acleavable linker. Exemplary cleavable linker sequences include, e.g.,Factor Xa or enterokinase (Invitrogen, San Diego Calif.). Otherpurification facilitating domains can be used, such as metal chelatingpeptides, e.g., polyhistidine tracts and histidine-tryptophan modulesthat allow purification on immobilized metals, protein A domains thatallow purification on immobilized immunoglobulin, and the domainutilized in the FLAGS extension/affinity purification system (ImmunexCorp, Seattle Wash.).

Biological Samples

The methods are based on comparison of two or more samples of proteins,one of which can be considered as the standard sample and all others canbe considered as samples under investigation. For example, in oneaspect, the invention provides a method for quantifying changes inprotein expression between at least two cellular states, such as, anactivated cell versus a resting cell, a normal cell versus a cancerouscell, a stem cell versus a differentiated cell, an injured cell orinfected cell versus an uninjured cell or uninfected cell; or, fordefining the expressed proteins associated with a given cellular state.

Sample can be derived from any biological source, including cells from,e.g., bacteria, insects, yeast, mammals and the like. Cells can beharvested from any body fluid or tissue source, or, they can be in vitrocell lines or cell cultures.

Detection Devices and Methods

The devices and methods of the invention can also incorporate in wholeor in part designs of detection devices as described, e.g., in U.S. Pat.Nos. 6,197,503; 6,197,498; 6,150,147; 6,083,763; 6,066,448; 6,045,996;6,025,601; 5,599,695; 5,981,956; 5,698,089; 5,578,832; 5,632,957.

A number of embodiments of the invention have been described.Nevertheless, it will be understood that various modifications may bemade without departing from the spirit and scope of the invention.

REFERENCES

Unless otherwise indicated, all references cited herein (supra andinfra) are incorporated by reference in their entirety.

-   Gygi S P, Rist B, Gerber S A, Turecek F, Gelb M H, Aebersold R.:    Quantitative analysis of complex protein mixtures using    isotope-coded affinity tags. Nat Biotechnol 17(10): 994-9 (October)    1999.-   Hopkins M J, Sharp R, Macfarlane G T.: Age and disease related    changes in intestinal bacterial populations assessed by cell    culture, 16S rRNA abundance, and community cellular fatty acid    profiles. Gut 48(2): 198-205 (February) 2001.-   Ritchie N J, Schutter M E, Dick R P, Myrold D D.: Use of length    heterogeneity PCR and fatty acid methyl ester profiles to    characterize microbial communities in soil. Appl Environ Microbiol    66(4): 1668-75 (April) 2000.-   Khan A A, Wang R F, Cao W W, Franklin W, Cerniglia C E.:    Reclassification of a polycyclic aromatic hydrocarbon-metabolizing    bacterium, Beijerinckia sp. strain B1, as Sphingomonas yanoikuyae by    fatty acid analysis, protein pattern analysis, DNA-DNA    hybridization, and 16S ribosomal DNA sequencing. Int J Syst    Bacteriol 46(2): 466-9 (April) 1996.-   Peltroche-Llacsahuanga H, Schmidt S, Lutticken R, Haase G.:    Discriminative power of fatty acid methyl ester (FAME) analysis    using the microbial identification system (MIS) for Candida    (Torulopsis) glabrata and Saccharomyces cerevisiae. Diagn Microbiol    Infect Dis 38(4): 213-21 (December) 2000.-   S A Gerber et al.: Analysis of rates of multiple enzymes in cell    lysates by electrospray ionization mass spectrometry. J. Am. Chem.    Soc. 121: 1102-3 1999.-   www.genomeweb.com David Goodlett discusses the latest in genomics    —ICAT reagents Written by: Marian Moser Jones Dec. 20, 2000-   WO0011208; Filed Aug. 25, 1999, Published Mar. 2, 2000. Aebersold R    H, Gelb M H, Gygi, SP, Scott C R, Turecek F, Gerber S A, Rist B:    Rapid quantitative analysis of proteins or protein function in    complex mixtures.-   WO9905221; Filed Jul. 27, 1998, Published Feb. 4, 1999. Cummins W J,    West R M, Smith J A: Cyanine Dyes.-   U.S. Pat. No. 4,876,350; Filed Dec. 16, 1987, Issued Oct. 24, 1989.    McGarrity J, Tenud L: Process for the production of (+) biotin.-   U.S. Pat. No. 5,776,723; Filed Feb. 8, 1996, Issued Jul. 7, 1998.    Herold C D, O'Hagan M: Rapid detection of mycobacterium    tuberculosis.-   U.S. Pat. No. 6,136,173; Filed Jun. 24, 1996, Issued Oct. 24, 2000.    Anderson N L, Anderson N G, Goodman J: Automated system for    two-dimensional electrophoresis.-   U.S. Pat. No. 6,127,134; Filed Apr. 20, 1995, Issued Oct. 3, 2000.    Minden J, Waggoner A: Difference gel electrophoresis using matched    multiple dyes.-   U.S. Pat. No. 6,064,754; Filed Dec. 1, 1997, Issued May 16, 2000.    Parekh R B, Amess R, Bruce J A, Prime S B, Platt A E, Stoney R M:    Computer-assisted methods and apparatus for identification and    characterization of biomolecules in a biological sample.-   U.S. Pat. No. 6,013,165; Filed May 22, 1998, Issued Jan. 11, 2000.    Wiktorowicz J E, Raysberg Y: Electrophoresis apparatus and method.-   Ausubel F M, Brent R, Kingston R E, Moore D D, Seidman J G, Smith J    A, Struhl K Editors. Current Protocols In Molecular Biology, Vol 2.    John Wiley & Sons, Inc, © 2001, 10.21.4-10.21.6, 10.22.5-10.22.10,    10.22.14, 10.22.15-10.22.20.-   Sambrook J, Russell D W Editors. Molecular Cloning A Laboratory    Manual 3^(rd) ed. Cold Spring Harbor Laboratory Press, New York, ©    2001, 18.3, 18.62, 18.66.-   Alting-Mecs M A and Short J M: Polycos vectors: a system for    packaging filamentous phage and phagemid vectors using lambda phage    packaging extracts. Gene 137: 1, 93-100, 1993.-   Arkin A P and Youvan D C: An algorithm for protein engineering:    simulations of recursive ensemble mutagenesis. Proc Natl Acad Sci    USA 89(16): 7811-7815, (Aug. 15) 1992.-   Arnold F H: Protein engineering for unusual environments. Current    Opinion in Biotechnology 4(4): 450455, 1993.-   Ausubel F M, et al Editors. Current Protocols in Molecular Biology,    Vols. 1 and 2 and supplements. (a.k.a. “The Red Book”) Greene    Publishing Assoc., Brooklyn, N.Y., ©) 1987.-   Ausubel F M, et al Editors. Current Protocols in Molecular Biology,    Vols. 1 and 2 and supplements. (a.k.a. “The Red Book”) Greene    Publishing Assoc., Brooklyn, N.Y., ©) 1989.-   Ausubel F M, et al Editors. Short Protocols in Molecular Biology: A    Compendium of Methods from Current Protocols in Molecular Biology.    Greene Publishing Assoc., Brooklyn, N.Y., ©1989.-   Ausubel F M, et al Editors. Short Protocols in Molecular Biology: A    Compendium of Methods from Current Protocols in Molecular Biology,    2^(nd) Edition. Greene Publishing Assoc., Brooklyn, N.Y., ©1992.-   Barbas C F 3d, Bain J D, Hoekstra D M, Lerner R A: Semisynthetic    combinatorial antibody libraries: a chemical solution to the    diversity problem. Proc Natl Acad Sci USA 89(10): 44574461, 1992.-   Bardwell A J, Bardwell L, Johnson D K, Friedberg E C: Yeast DNA    recombination and repair proteins Rad1 and Rad10 constitute a    complex in vivo mediated by localized hydrophobic domains. Mol    Microbiol 8(6): 1177-1188, 1993.-   Barret A J, et al., eds.: Enzyme Nomenclature: Recommendations of    the Nomenclature Committee of the International Union of    Biochemistry and Molecular Biology. San Diego: Academic Press, Inc.,    1992.-   Bartel P, Chien C T, Sternglanz R, Fields S: Elimination of false    positives that arise in using the two-hybrid system. Biotechniques    14(6): 920-924, 1993.-   Beaudry A A and Joyce G F: Directed evolution of an RNA enzyme.    Science 257(5070): 635-641, 1992.-   Berger and Kimmel, Methods in Enzymology, Volume 152, Guide to    Molecular Cloning Techniques. Academic Press, Inc., San Diego,    Calif., ©1987. (Cumulative Subject Index: Volumes 135-139, 141-167,    1990, 272 pp.)-   Bevan M: Binary Agrobacterium vectors for plant transformation.    Nucleic Acids Research 12(22): 8711-21, 1984.-   Biocca S, Pierandrei-Amaldi P, Cattaneo A: Intracellular expression    of anti-p21ras single chain Fv fragments inhibits meiotic maturation    of xenopus oocytes. Biochem Biophys Res Commun 197(2): 422-427,    1993.-   Bird et al. Plant Mol Biol 11: 651, 1988.-   Bogerd H P, Fridell R A, Blair W S, Cullen B R: Genetic evidence    that the Tat proteins of human immunodeficiency virus types 1 and 2    can multimerize in the eukaryotic cell nucleus. J Virol 67(8):    5030-5034, 1993.-   Boyce C O L, ed.: Novo's Handbook of Practical Biotechnology. 2^(nd)    ed. Bagsvaerd, Denmark, 1986.-   Brederode F T, Koper-Zawrthoff E C, Bol J F: Complete nucleotide    sequence of alfalfa mosaic virus RNA 4. Nucleic Acids Research    8(10): 2213-23, 1980.-   Breitling F, Dubel S, Seehaus T, Klewinghaus I, Little M: A surface    expression vector for antibody screening. Gene 104(2): 147-153,    1991.-   Brown N L, Smith M: Cleavage specificity of the restriction    endonuclease isolated from Haemophilus gallinarum (Hga 1). Proc Natl    Acad Sci USA 74(8): 3213-6, (August) 1977.-   Burton D R, Barbas C F 3d, Persson M A, Koenig S, Chanock R M,    Lerner R A: A large array of human monoclonal antibodies to type I    human immunodeficiency virus from combinatorial libraries of    asymptomatic seropositive individuals. Proc Natl Acad Sci USA    88(22): 10134-7, (Nov. 15) 1991.-   Caldwell R C and Joyce G F: Randomization of genes by PCR    mutagenesis. PCR Methods Appl 2(10): 28-33, 1992.-   Caton A J and Koprowski H: Influenze virus hemagglutinin-specific    antibodies isolatedf from a combinatorial expression library are    closely related to the immune response of the donor. Proc Natl Acad    Sci USA 87(16): 6450-6454, 1990.-   Chakraborty T, Martin J F, Olson E N: Analysis of the    oligomerization of myogenin and E2A products in vivo using a    two-hybrid assay system. J Biol Chem 267(25): 17498-501, 1992.-   Chang C N, Landolfi N F, Queen C: Expression of antibody Fab domains    on bacteriophage surfaces. Potential use for antibody selection. J    Immunol 147(10): 36104, (Nov. 15) 1991.-   Chaudhary V K, Batra J K, Gallo M G, Willingham M C, FitzGerald D J,    Pastan 1: A rapid method of cloning functional variable-region    antibody genes in Escherichia coli as single-chain immunotoxins.    Proc Natl Acad Sci USA 87(3): 1066-1070, 1990.-   Chien C T, Bartel P L, Sternglanz R, Fields S: The two-hybrid    system: a method to identify and clone genes for proteins that    interact with a protein of interest. Proc Natl Acad Sci USA 88(21):    9578-9582, 1991.-   Chiswell D J, McCafferty J: Phage antibodies: will new ‘coliclonal’    antibodies replace monoclonal antibodies? Trends Biotechnol 10(3):    80-84, 1992.-   Chothia C and Lesk A M: Canonical structures for the hypervariable    regions of immunoglobulins. J Mol Biol 196)4): 901-917,1987.-   Chothia C, Lesk A M, Tramontano A, Levitt M, Smith-Gill S J, Air G,    Sheriff S, Padlan E A, Davies D, Tulip W R, et al: Conformations of    immunoglobulin hypervariable regions. Nature 342(6252): 877-883,    1989.-   Clackson T, Hoogenboom H R, Griffiths A D, Winter G: Making antibody    fragments using phage display libraries. Nature 352(6336): 624-628,    1991.-   Conrad M, Topal M D: DNA and spermidine provide a switch mechanism    to regulate the activity of restriction enzyme Nae 1. Proc Natl Acad    Sci USA 86(24): 9707-11, (December) 1989.-   Coruzzi G, Broglie R, Edwards C, Chua N H: Tissue-specific and    light-regulated expression of a pea nuclear gene encoding the small    subunit of ribulose-1,5-bisphosphate carboxylase. EMBO J. 3(8):    1671-9, 1984.-   Dasmahapatra B, DiDomenico B, Dwyer S, Ma J, Sadowski 1, Schwartz J:    A genetic system for studying the activity of a proteolytic enzyme.    Proc Natl Acad Sci USA 89(9): 41594162, 1992.-   Davis L G, Dibner M D, Battey J F. Basic Methods in Molecular    Biology. Elsevier, New York, N.Y., ©1986.-   Delegrave S and Youvan D C. Biotechnology Research 11: 1548-1552,    1993.-   DeLong E F, Wu K Y, Prezelin B B, Jovine R V: High abundance of    Archaea in Antarctic marine picoplankton. Nature 371 (6499):    695-697, 1994.-   Deng S J, MacKenzie C R, Sadowska J, Michniewicz J, Young N M,    Bundle Dr, Narang S A: Selection of antibody single-chain variable    fragments with improved carbohydrate binding by phage display. J    Biol Chem 269(13): 9533-9538, 1994.-   Drauz K, Waldman H, eds.: Enzyme Catalysis in Organic Synthesis: A    Comprehensive Handbook. Vol. 1. New York: VCH Publishers, 1995.-   Drauz K, Waldman H, eds.: Enzyme Catalysis in Organic Synthesis: A    Comprehensive Handbook. Vol. 2. New York: VCH Publishers, 1995.-   Duan L, Bagasra 0, Laughlin M A, Oakes J W, Pomerantz R J: Potent    inhibition of human immunodeficiency virus type I replication by an    intracellular anti-Rev single-chain antibody. Proc Natl Acad Sci USA    91(11): 5075-5079, 1994.-   Durfee T, Becherer K, Chen P L, Yeh S H, Yang Y, Kilburn A E, Lee W    H, Elledge S J: The retinoblastoma protein associates with the    protein phosphatase type I catalytic subunit. Genes Dev 7(4):    555-569, 1993.-   Ellington A D and Szostak J W: In vitro selection of RNA molecules    that bind specific ligands. Nature 346(6287): 818-822, 1990.-   Fields S and Song 0: A novel genetic system to detect    protein-protein interactions. Nature 340(6230): 245-246, 1989.-   Firek S, Draper J, Owen M R, Gandecha A, Cockburn B, Whitelam G C:    Secretion of a functional single-chain Fv protein in transgenic    tobacco plants and cell suspension cultures. Plant Mol Biol 23(4):    861-870, 1993.-   Forsblom S, Rigler R, Ehrenberg M, Phil ipson L: Kinetic studies on    the cleavage of adenovirus DNA by restriction endonuclease Eco RI.    Nucleic Acids Res 3(12): 3255-69, (December) 1976.-   Foster G D, Taylor S C, eds.: Plant Virology Protocols: From Virus    Isolation to Transgenic Resistance. Methods in Molecular Biology,    Vol. 81. N.J.: Humana Press Inc., 1998.-   Franks F, ed.: Protein Biotechnology: Isolation, Characterization,    and Stabilization. New Jersey: Humana Press Inc., 1993.

Germino F J, Wang Z X, Weissman S M: Screening for in vivoprotein-protein interactions. Proc Natl Acad Sci USA 90(3): 933-937,1993.

-   Gingeras T R, Brooks J E: Cloned restriction/modification system    from Pseudomonas aeruginosa. Proc Natl Acad Sci USA 80(2): 402-6,    1983 (January).-   Gluzman Y: SV40-transformed simian cells support the replication of    early SV40 mutants. Cell 23(1): 175-182, 1981.-   Godfrey T, West S, eds.: Industrial Enzymology. 2^(nd) ed. London:    Macmillan Press Ltd, 1996.-   Gottschalk G: Bacterial Metabolism. 2nd ed. New York:    Springer-Verlag Inc., 1986.-   Gresshoff P M, ed.: Technology Transfer of Plant Biotechnology.    Current Topics in Plant Molecular Biology. Boca Raton: CRC Press,    1997.-   Griffin H G, Griffin A M, eds.: PCR Technology: Currrent    Innovations. Boca Raton: CRC Press, Inc., 1994.-   Gruber M, Schodin B A, Wilson E R, Kranz D M: Efficient tumor cell    lysis mediated by a bispecific single chain antibody expressed in    Escherichia coli. J Immunol 152(11): 5368-5374, 1994.-   Guarente L: Strategies for the identification of interacting    proteins. Proc Natl Acad Sci USA 90(5): 1639-1641, 1993.-   Guilley H, Dudley R K, Jonard G, Balazs E, Richards K E:    Transcription of Cauliflower mosaic virus DNA: detection of promoter    sequences, and characterization of transcripts. Cell 30(3): 763-73,    1982.-   Hansen G, Chilton M D: Lessons in gene transfer to plants by a    gifted microbe. Curr Top Microbiol Immunol 240: 21-57, 1999.-   Hardy C F, Sussel L, Shore D: A RAP1-interacting protein involved in    transcriptional silencing and telomere length regulation. Genes Dev    6(5): 801-814, 1992.-   Hartmann H T, et al.: Plant Propagation: Principles and Practices.    6^(th) ed. New Jersey: Prentice Hall, Inc., 1997.-   Hawkins R E and Winter G: Cell selection strategies for making    antibodies from variable gene libraries: trapping the memory pool.    Eur J Immunol 22(3): 867-870, 1992.-   Holvoet P, Laroche Y, Lijnen H R, Van Hoef B, Brouwers E, De Cock F,    Lauwereys M, Gansemans Y, Collen D: Biochemical characterization of    single-chain chimeric plasminogen activators consisting of a    single-chain Fv fragment of a fibrin-specific antibody and    single-chain urokinase. Eur J Biochem 210(3): 945-952, 1992.-   Honjo T, Alt F W, Rabbitts T H (eds): Immunoglobulin genes. Academic    Press: San Diego, Calif., pp. 361-368, ©1989.-   Hoogenboom H R, Griffiths A D, Johnson K S, Chiswell D J, Judson P,    Winter G: Multi-subunit proteins on the surface of filamentous    phage: methodologies for displaying antibody (Fab) heavy and light    chains. Nucleic Acids Res 19(15): 41334137, 1991.-   Huse W D, Sastry L, Iverson S A, Kang A S, Alting-Mees M, Burton D    R, Benkovic S J, Lemer R A: Generation of a large combinatorial    library of the immunoglobulin repertoire in phage lambda. Science    246(4935): 1275-1281, 1989.-   Huston J S, Levinson D, Mudgett-Hunter M, Tai M S, Novotney J,    Margolies M N, Ridge R J, Bruccoleri R E, Haber E, Crea R, et al:    Protein engineering of antibody binding sites: recovery of specific    activity in an anti-digoxin single-chain Fv analogue produced in    Escherichia coli. Proc Natl Acad Sci USA 85(16): 5879-5883, 1988.-   Ivan Lefkovits, Editor. Immunology methods manual: the comprehensive    sourcebook of techniques. Academic Press, San Diego, ©1997.-   Iwabuchi K, Li B, Bartel P, Fields S: Use of the two-hybrid system    to identify the domain of p53 involved in oligomerization. Oncogene    8(6): 1693-1696, 1993.

Jackson A L, Pahl P M, Harrison K, Rosamond J, Sclafani R A: Cell cycleregulation of the yeast Cdc7 protein kinase by association with the Dbf4protein. Mol Cell Biol 13(5): 2899-2908, 1993.

-   Johnson S and Bird R E: Methods Enzymol 203: 88, 1991.-   Kabat et al: Sequences of Proteins of Immunological Interest, 4th    Ed. U.S. Department of Health and Human Services, Bethesda, Md.    (1987)-   Kang A S, Barbas C F, Janda K D, Benkovic S J, Lerner R A: Linkage    of recognition and replication functions by assembling combinatorial    antibody Fab libraries along phage surfaces. Proc Natl Acad Sci USA    88(10): 4363-4366, 1991.-   Kettleborough C A, Ansell K H, Allen R W, Rosell-Vives E, Gussow D    H, Bendig M M: Isolation of tumor cell-specific single-chain Fv from    immunized mice using phage-antibody libraries and the    re-construction of whole antibodies from these antibody fragments.    Eur J Immunol 24(4): 952-958, 1994.-   Kruger D H, Barcak G J, Reuter M, Smith H O: EcoRII can be activated    to cleave refractory DNA recognition sites. Nucleic Acids Res 16(9):    3997-4008, (May 11) 1988.-   Lalo D, Carles C, Sentenac A, Thuriaux P: Interactions between three    common subunits of yeast RNA polymerases I and III. Proc Natl Acad    Sci USA 90(12): 5524-5528, 1993.-   Laskowski M Sr: Purification and properties of venom    phosphodiesterase. Methods Enzymol 65(1): 276-84, 1980.-   Lefkovits I and Pemis B, Editors. Immunological Methods, Vols. I    and II. Academic Press, New York, N.Y. Also Vol. III published in    Orlando and Vol. IV published in San Diego. ©1979-.-   Lerner R A, Kang A S, Bain J D, Burton D R, Barbas C F 3d:    Antibodies without immunization. Science 258(5086): 1313-1314, 1992.-   Leung, D. W., et al, Technique, 1: 11-15, 1989.-   Li B and Fields S: Identification of mutations in p53 that affect    its binding to SV40 large T antigen by using the yeast two-hybrid    system. FASEB J 7(10): 957-963, 1993.-   Lilley G G, Doelzal 0, Hillyard C J, Bernard C, Hudson P J:    Recombinant single-chain antibody peptide conjugates expressed in    Escherichia coli for the rapid diagnosis of HIV. J Immunol Methods    171(2): 211-226, 1994.-   Lowman H B, Bass S H, Simpson N, Wells J A: Selecting high-affinity    binding proteins by monovalent phage display. Biochemistry 30(45):    10832-10838, 1991.-   Luban J, Bossolt K L, Franke E K, Kalpana G V, Goff S P: Human    immunodeficiency virus type I Gag protein binds to cyclophilins A    and B. Cell 73(6): 1067-1078, 1993.-   Madura K, Dohmen R J, Varshavsky A: N-recognin/Ubc2 interactions in    the N-end rule pathway. J Biol Chem 268(16): 12046-54, (Jun. 5)    1993.-   Marks J D, Griffiths Ad, Malmqvist M, Clackson T P, Bye J M, Winter    G: By-passing immunization: building high affinity human antibodies    by chain shuffling. Biotechnology (N Y) 10(7): 779-783, 1992.-   Marks J D, Hoogenboom H R, Bonnert T P, McCafferty J, Griffiths A D,    Winter G: By-passing immunization. Human antibodies from V-gene    libraries displayed on phage. J Mol Biol 222(3): 581-597, 1991.-   Marks J D, Hoogenboom H R, Griffiths A D, Winter G: Molecular    evolution of proteins on filamentous phage. Mimicking the strategy    of the immune system. J Biol Chem 267(23): 16007-16010, 1992.-   Maxam A M, Gilbert W: Sequencing end-labeled DNA with base-specific    chemical cleavages. Methods Enzymol 65(1): 499-560, 1980.-   McCafferty J, Griffiths A D, Winter G, Chiswell D J: Phage    antibodies: filamentous phage displaying antibody variable domains.    Nature 348(6301): 552-554, 1990. Method of DNA sequencing.-   Miller J H. A Short Course in Bacterial Genetics: A Laboratory    Manual and Handbook for Escherichia coli and Related Bacteria (see    inclusively p. 445). Cold Spring Harbor Laboratory Press, Plainview,    N.Y., © 1992.-   Milne G T and Weaver D T: Dominant negative alleles of RAD52 reveal    a DNA repair/recombination complex including Rad51 and Rad52. Genes    Dev 7(9): 1755-1765, 1993.-   Mullinax R L, Gross E A, Amberg J R, Hay B N, Hogrefe H H, Kubtiz M    M, Greener A, Alting-Mees M, Ardourel D, Short J M, et al:    Identification of human antibody fragment clones specific for    tetanus toxoid in a bacteriophage lambda immunoexpression library.    Proc Natl Acad Sci USA 87(20): 8095-9099, 1990.-   Nath K, Azzolina B A: in Gene Amplification and Analysis (ed.    Chirikjian J G), vol. 1, p. 113, Elsevier North Holland, Inc., New    York, N.Y., © 1981.-   Needleman S B and Wunsch C D: A general method applicable to the    search for similarities in the amino acid sequence of two proteins.    J Mol Biol 48(3): 443453, 1970.-   Nelson M, Christ C, Schildkraut 1: Alteration of apparent    restriction endonuclease recognition specificities by DNA    methylases. Nucleic Acids Res 12(13): 5165-73, 1984 (Jul. 11).-   Nicholls P J, Johnson V G, Andrew S M, Hoogenboom H R, Raus J C,    Youle R J: Characterization of single-chain antibody (sFv)-toxin    fusion proteins produced in vitro in rabbit reticulocyte lysate. J    Biol Chem 268(7): 5302-5308, 1993.-   Oiler A R, Vanden Broek W, Conrad M, Topal M D: Ability of DNA and    spermidine to affect the activity of restriction endonucleases from    several bacterial species. Biochemistry 30(9): 2543-9, (Mar. 5)    1991.-   Owen M R L, Pen J: Transgenic Plants: A Production System for    Industrial and Pharmaceutical Proteins. Chichester: John Wiley &    Sons, 1996.-   Owens R J and Young R J: The genetic engineering of monoclonal    antibodies. J Immunol Methods 168(2): 149-165, 1994.-   Pearson W R and Lipman D J: Improved tools for biological sequence    comparison. Proc Natl Acad Sci USA 85(8): 2444-2448, 1988.-   Pein C D, Reuter M, Meisel A, Cech D, Kruger D H: Activation of    restriction endonuclease EcoRII does not depend on the cleavage of    stimulator DNA. Nucleic Acids Res 19(19): 5139-42, (Oct. 11) 1991.-   Persson M A, Caothien R H, Burton D R: Generation of diverse    high-affinity human monoclonal antibodies by repertoire cloning.    Proc Natl Acad Sci USA 88(6): 2432-2436, 1991.-   Perun T J, Propst C L, eds.: Computer-Aided Drug Design: Methods and    Applications. New York: Marcel Dekker, Inc., 1989.-   Qiang B Q, McClelland M, Poddar S, Spokauskas A, Nelson M: The    apparent specificity of NotI (5′-GCGGCCGC-3′) is enhanced by    M.FnuDII or M.BepI methyltransferases (5′-mCGCG-3′): cutting    bacterial chromosomes into a few large pieces. Gene 88(1): 101-5,    (Mar. 30) 1990.-   Queen C, Foster J, Stauber C, Stafford J: Cell-type specific    regulation of a kappa immunoglobulin gene by promoter and enhance    elements. Immunol Rev 89: 49-68, 1986.-   Raleigh E A, Wilson G: Escherichia coli K-12 restricts DNA    containing 5-methylcytosine. Proc Natl Acad Sci USA 83(23): 9070-4,    (December) 1986.-   Reidhaar-Olson J F and Sauer R T: Combinatorial cassette mutagenesis    as a probe of the informational content of protein sequences.    Science 241(4861): 53-57, 1988.-   Riechmann L and Weill M: Phage display and selection of a    site-directed randomized single-chain antibody Fv fragment for its    affinity improvement. Biochemistry 32(34): 8848-8855, 1993.-   Roberts R J, Macelis D: REBASE—restriction enzymes and methylases.    Nucleic Acids Res 24(1): 223-35, (Jan. 1) 1996.-   Ryan A J, Royal C L, Hutchinson J, Shaw C H: Genomic sequence of a    12S seed storage protein from oilseed rape (Brassica napus c.v. jet    neuf). Nucl Acids Res 17(9): 3584, 1989.-   Sambrook J. Fritsch E F, Maniatis T. Molecular Cloning: A Laboratory    Manual. Cold Spring Harbor Laboratory Press, Cold Spring Harbor,    N.Y., © 1982.-   Sambrook J. Fritsch E F, Maniatis T. Molecular Cloning: A Laboratory    Manual. Second Edition. Cold Spring Harbor Laboratory Press, Cold    Spring Harbor, N.Y., ©1989.-   Scopes R K. Protein Purification: Principles and Practice.    Springer-Verlag, New York, N.Y., © 1982.-   Segel I H: Enzyme Kinetics: Behavior and Analysis of Rapid    Equilibrium and Steady-State Enzyme Systems. New York: John Wiley &    Sons, Inc., 1993.-   Silver S C and Hunt S W 3d: Techniques for cloning cDNAs encoding    interactive transcriptional regulatory proteins. Mol Biol Rep 17(3):    155-165, 1993.-   Smith T F, Waterman M S, Fitch W M: Comparative biosequence metrics.    J Mol Evol S18(1): 3846, 1981.-   Smith T F, Waterman M S. Adv Appl Math 2: 482-end of article, 1981.-   Smith T F, Waterman M S: Identification of common molecular    subsequences. J Mol Biol 147(1): 195-7, (Mar. 25) 1981.-   Smith T F, Waterman M S: Overlapping genes and information theory. J    Theor Biol 91(2): 379-80, (Jul. 21) 1981.-   Staudinger J, Perry M, Elledge S J, Olson E N: Interactions among    vertebrate helix-loop-helix proteins in yeast using the two-hybrid    system. J Biol Chem 268(7): 4608-4611, 1993.-   Stemmer W P, Morris S K, Wilson B S: Selection of an active single    chain Fv antibody from a protein linker library prepared by    enzymatic inverse PCR. Biotechniques 14(2): 256-265, 1993.-   Stemmer W P: DNA shuffling by random fragmentation and reassembly:    in vitro recombination for molecular evolution. Proc Natl Acad Sci    USA 91(22): 10747-10751, 1994.-   Sun D, Hurley L H: Effect of the (+)-CC-1065-(N-3-adenine) DNA    adduct on in vitro DNA synthesis mediated by Escherichia coli DNA    polymerase. Biochemistry 31: 10, 2822-9, (Mar. 17) 1992,-   Tague B W, Dickinson C D, Chrispeels M J: A short domain of the    plant vacuolar protein phytohemagglutinin targets invertase to the    yeast vacuole. Plant Cell 2(6): 533-46, (June) 1990.-   Takahashi N, Kobayashi 1: Evidence for the double-strand break    repair model of bacteriophage lambda recombination. Proc Natl Acad    Sci USA 87(7): 27904, (April) 1990.-   Thiesen H J and Bach C: Target Detection Assay (TDA): a versatile    procedure to determine DNA binding sites as demonstrated on SPI    protein. Nucleic Acids Res 18(11): 3203-3209, 1990.-   Thomas M, Davis R W: Studies on the cleavage of bacteriophage lambda    DNA with EcoRI Restriction endonuclease. J Mol Biol 91(3): 315-28,    (Jan. 25) 1975.-   Tingey S V, Walker E L, Corruzzi G M: Glutamine synthetase genes of    pea encode distinct polypeptides which are differentially expressed    in leaves, roots and nodules. EMBO J. 6(1): 1-9, 1987.-   Topal M D, Thresher R J, Conrad M, Griffith J: Nael endonuclease    binding to pBR322 DNA induces looping. Biochemistry 30(7): 2006-10,    (Feb. 19) 1991.-   Tramontano A, Chothia C, Lesk A M: Framework residue 71 is a major    determinant of the position and conformation of the second    hypervariable region in the V_(H) domains of immunoglobulins. J Mol    Biol 215(1): 175-182, 1990.-   Tuerk C and Gold L: Systematic evolution of ligands by exponential    enrichment: RNA ligands to bacteriophage T4 DNA polymerase. Science    249(4968): 505-510, 1990.-   U.S. Pat. No. 4,683,195; Filed Feb. 7, 1986, Issued Jul. 28, 1987.    Mullis K B, Erlich H A, Arnheim N, Horn G T, Saiki R K, Scharf S J:    Process for Amplifying, Detecting, and/or Cloning Nucleic Acid    Sequences.-   U.S. Pat. No. 4,683,202; Filed Oct. 25, 1985, Issued Jul. 28, 1987.    Mullis K B: Process for Amplifying Nucleic Acid Sequences.-   U.S. Pat. No. 4,704,362; Filed Nov. 5, 1979, Issued Nov. 3, 1987.    Itakura K, Riggs A D: Recombinant Cloning Vehicle Microbial    Polypeptide Expression.-   U.S. Pat. No. 4,713,337; Filed Jan. 3, 1985, Issued Dec. 15, 1987.    Jasin M, Schimmel P R: Method for deletion of a gene from a    bacteria.-   U.S. Pat. No. 4,732,856; Filed Apr. 3, 1984, Issued Mar. 22, 1988.    Federoff N V: Transposable elements and process for using same.-   U.S. Pat. No. 4,963,487; Filed Sep. 14, 1987, Issued Jan. 16, 1990.    Schimmel P R: Method for deletion of a gene from a bacteria.-   U.S. Pat. No. 5,354,656; Filed Oct. 2, 1989, Issued Oct. 11, 1994.    Sorge, Joseph A.; Huse, William D.:-   U.S. Pat. No. 5,385,835; Filed May 19, 1994, Issued Jan. 31, 1995.    Helentjaris, Timothy; Nienhuis, James: Identification and    localization and introgression into plants of desired multigenic    traits.-   U.S. Pat. No. 5,453,247; Filed Nov. 23, 1993, Issued Sep. 26, 1995.    Beavis, Ronald C.; Chait, Brian T.: Instrument and method for the    sequencing of genome.-   U.S. Pat. No. 5,604,100; Filed Jul. 19, 1995, Issued Feb. 18, 1997.    Perlin, Mark W.: Method and system for sequencing genomes.-   U.S. Pat. No. 5,670,321; Filed May 10, 1995, Issued Sep. 23, 1997.    Kimmel, Bruce E.; Ellis, Michael; Ruddy, David: Efficient method to    conduct large-scale genome sequencing.-   U.S. Pat. No. 5,925,808; Filed Dec. 19, 1997, Issued Jul. 20, 1999.    Oliver, Melvin John; Quisenberry, Jerry Edwin; Trolinder, Norma Lee    Glover; Keim, Don Lee: Control Of Plant Gene Expression.-   U.S. Pat. No. 5,953,727; Filed Mar. 6, 1997, Issued Sep. 14, 1999.    Maslyn, Timothy J.; Au-Young, Janice; Hillman, Jennifer L.; Hibbert,    Harold; Akerblom, Ingrid E.; Cheng, Rachel J.; Tang, Yuanhua T.:    Project-based full-length biomolecular sequence database.-   U.S. Pat. No. 5,965,443; Filed Sep. 9, 1996, Issued Oct. 12, 1999.    Reznikoff W S, Goryshin I Y: System for in vitro transposition.-   U.S. Pat. No. 5,981,177; Filed Jan. 25, 1995, Issued Nov. 9, 1999.    Demirjian D C, Casadaban M J, Weber M, Gaines G L: Protein fusion    method and constructs.-   U.S. Pat. No. 5,994,058; Filed Mar. 20, 1995, Issued Nov. 30, 1999.    Senaphthy, Periannan: Method For Contiguous Genome Sequencing.-   U.S. Pat. No. 6,023,659; Filed Mar. 6, 1997, Issued Feb. 8, 2000.    Seilhamer, Jeffrey J.; Akerblom, Ingrid E.; Altus, Christina M.;    Klingler, Tod M.; Russo, Frank; Au-Young, Janice; Hillman, Jennifer    L.; Maslyn, Timothy J.: Database System Employing Protein Function    Hierarchies For Viewing Biomolecular Sequence Data.-   van de Poll M L, Lafleur M V, van Gog F, Vrieling H, Meerman J H:    N-acetylated and deacetylated 4′-fluoro-4-aminobiphenyl and    4-aminobiphenyl adducts differ in their ability to inhibit DNA    replication of single-stranded M 13 in vitro and of single-stranded    phi X174 in Escherichia coli. Carcinogenesis 13(5): 751-8, (May)    1992.-   Vojtek A B, Hollenberg S M, Cooper J A: Mammalian Ras interacts    directly with the serine/threonine kinase Raf. Cell 74(1): 205-214,    1993.-   Wenzier H, Mignery G, Fisher L, Park W: Sucrose-regulated expression    of a chimeric potato tuber gene in leaves of transgenic tobacco    plants. Plant Mol Biol 13(4): 347-54, 1989.-   White J S, White D C: Source Book of Enzymes. Boca Raton: CRC Press,    1997.-   Williams and Barclay, in Immunoglobulin Genes, The Immunoglobulin    Gene Superfamily-   Winnacker E L. From Genes to Clones: Introduction to Gene    Technology. VCH Publishers, New York, N.Y., ©) 1987.-   Winter G and Milstein C: Man-made antibodies. Nature 349(6307):    293-299, 1991.-   WO 00/04190; Filed Jul. 15, 1999, Published Jan. 27, 2000. Del    Cardayre S, Tobin M, Stemmer W P, Ness J E, Minshull J, Patten P A,    Subramanian V, Castle L A, Krebber C M, Bass S, Zhang Y, Cox T,    Huisman G, Yuan L, Affholter J A: Evolution of whole cells and    organisms by recursive sequence recombination.-   WO 00/09755; Filed Aug. 12, 1999, Published Feb. 24, 2000. Zarling    D, Reddy G, Pati S: Domain specific gene evolution.-   WO 88/08453; Filed Apr. 14, 1988, Published Nov. 3, 1988. Alakhov J    B, Baranov, VI, Ovodov S J, Ryabova L A, Spirin A S: Method of    Obtaining Polypeptides in Cell-Free Translation System.-   WO 90/05785; Filed Nov. 15, 1989, Published May 31, 1990. Schultz P:    Method for Site-Specifically Incorporating Unnatural Amino Acids    into Proteins.-   WO 90/07003; Filed Jan. 27, 1989, Published Jun. 28, 1990. Baranov V    I, Morozov I J, Spirin A S: Method for Preparative Expression of    Genes in a Cell-free System of Conjugated Transcription/translation.-   WO 91/02076; Filed Jun. 14, 1990, Published Feb. 21, 1991. Baranov V    I, Ryabova L A, Yarchuk O B, Spirin A S: Method for Obtaining    Polypeptides in a Cell-free System.-   WO 91/05058; Filed Oct. 5, 1989, Published Apr. 18, 1991. Kawasaki    G: Cell-free Synthesis and Isolation of Novel Genes and    Polypeptides.-   WO 91/17271; Filed May 1, 1990, Published Nov. 14, 1991. Dower W J,    Cwirla S E: Recombinant Library Screening Methods.-   WO 91/18980; Filed May 13, 1991, Published Dec. 12, 1991. Devlin J    J: Compositions and Methods for Indentifying Biologically Active    Molecules.-   WO 91/19818; Filed Jun. 20, 1990, Published Dec. 26, 1991. Dower W    J, Cwirla S E, Barrett R W: Peptide Library and Screening Systems.-   WO 92/02536; Filed Aug. 1, 1991, Published Feb. 20, 1992. Gold L,    Tuerk C: Systematic Polypeptide Evolution by Reverse Translation.-   WO 92/03918; Filed Aug. 28, 1991, Published Mar. 19, 1992. Lonberg    N, Kay R M: Transgenic Non-human Animals Capable of Producing    Heterologous Antibodies.-   WO92/05258; Filed Sep. 17, 1991, Published Apr. 2, 1992. Fincher G    B: Gene Encoding Barley Enzyme.-   WO 92/14843; Filed Feb. 21, 1992, Published Sep. 3, 1992. Toole J J,    Griffin L C, Bock L C, Latham J A, Muenchau D D, Krawczyk S:    Aptamers Specific for Biomolecules and Method of Making.-   WO 93/08278; Filed Oct. 15, 1992, Published Apr. 29, 1993. Schatz P    J, Cull M G, Miller J F, Stemmer W P: Peptide Library and Screening    Method.-   WO 93/12227; Filed Dec. 17, 1992, Published Jun. 24, 1993. Lonberg    N, Kay R M: Transgenic Non-human Animals Capable of Producing    Heterologous Antibodies.-   WO 94/25585; Filed Apr. 25, 1994, Published Nov. 10, 1994. Lonberg    N, Kay R M: Transgenic Non-human Animals Capable of Producing    Heterologous Antibodies.-   WO 95/00530; Filed Jun. 6, 1994, Published Jan. 1, 1995. Fodor,    Stephen, P., A.; Lipshutz, Robert, J.; Huang, Xiaohua; Jevons, Luis,    Carlos: Hybridization and Sequencing of Nucleic Acids.-   WO 96/21031; Filed Jun. 7, 1995, Published Jul. 11, 1996. Tricoli,    David, M.; Carney, Kim, J.; Russell, Paul, F.; Quemada, Hector, D.;    Mcmaster, J., Russell; Reynolds, John, F.; Deng, Rosaline, Z.:    Transgenic Plants Expressing DNA Constructs Containing A Plurality    Of Genes To Impart Virus Resistance.-   WO 96/27025; Filed Feb. 21, 1996, Published Sep. 6, 1996. Rabani,    Ely, Michael: Device, Compounds, Algorithms, And Methods Of    Molecular Characterization And Manipulation With Molecular    Parallelism.-   WO 97/17429; Filed Nov. 8, 1996, Published May 15, 1997.    Oglevee-O'donovan, Wendy; Arteca, Richard, N.; Arteca, Jeannette;    Stoots, Eleanor: Method For The Commercial Production Of Transgenic    Plants.-   WO 97/35966; Filed Mar. 20, 1997, Published Oct. 2, 1997. Minshull    J, Stemmer W P: Methods and compositions for cellular and metabolic    engineering.-   WO 97/37041; Filed Mar. 18, 1997, Published Oct. 9, 1997. Köster,    Hubert: DNA Sequencing By Mass Spectrometry.-   WO 97/42348; Filed May 5, 1997, Published Nov. 13, 1997. Köster,    Hubert; Van Den Boom, Dirk; Ruppert, Andreas: Process For Direct    Sequencing During Template Amplification.-   WO 98/26407; Filed Dec. 11, 1997, Published Jun. 18, 1998. Sabatini,    Cathryn, E.; Heath, Joe, Don; Covitz, Peter, A.; Klinger, Tod, M.;    Russo, Frank, D.; Berry, Stephanie, F.: Database And System For    Storing, Comparing And Displaying Genomic Information.-   WO 98/26408; Filed Dec. 11, 1997, Published Jun. 18, 1998. Sabatini,    Cathryn, E.; Heath, Joe, Don; Covitz, Peter, A.; Klingler, Tod, M.;    Russo, Frank, D.; Berry, Stephanie, F.: Database And System For    Determining, Storing And Displaying Gene Locus Information.-   WO 98/31833; Filed Dec. 12, 1997, Published Jul. 23, 1998. Ju,    Jingyue: Nucleic Acid Sequencing With Solid Phase Capturable    Terminators.-   WO 98/31834; Filed Dec. 12, 1997, Published Jul. 23, 1998. Ju,    Jingyue: Sets Of Labeled Energy Transfer Fluorescent Primers And    Their Use In Multi Component Analysis.-   WO 98/31837; Filed Jan. 16, 1998, Published Jul. 23, 1998.    Delcardayre S B, Tobin M B, Stemmer W P, Ness J E, Minshull J,    Patten P: Evolution of whole cells and organisms by recursive    sequence recombination.-   WO 98/36085; Filed Feb. 13, 1998, Published Aug. 20, 1998. Sutliff,    Thomas, D.; Rodriguez, Raymond, L.: Production Of Mature Proteins In    Plants.-   WO 98/37223; Filed Feb. 18, 1998, Published Aug. 27, 1998. Pang,    Sheng-Zhi; Gonsalves, Dennis; Jan, Fuh-Jyh: DNA Construct To Confer    Multiple Traits On Plants.-   WO 99/35494; Filed Jan. 8, 1999, Published Jul. 15, 1999. Tally F P,    Tao J, Wendler P A, Connelly G, Gallant P L: Method for identifying    validated target and assay combinations.-   WO 99/37755; Filed Dec. 11, 1998, Published Jul. 29, 1999. Pati S,    Zarling David, Lehman C W, Zeng H: The use of consensus sequences    for targeted homologous gene isolation and recombination in gene    families.-   WO 99/49403; Filed Mar. 25, 1999, Published Sep. 30, 1999. Lincoln,    Stephen, E.-   Hodgson, David, M.; Spiro, Peter, A.; Russo, Frank, D.; Akerblom,    Ingrid, E.; Hillman, Jennifer, L.; Jones, Anissa, Lee; Bratcher,    Shawn, Robert; Cohen, Howard, Jerome; Dufour, Gerard; Wood, Michael,    Peter; Koleszar, Alexander, George; Banville, Steven, C.: System And    Methods For Analyzing Biomolecular Sequences.-   WO95/11995; Filed Oct. 26, 1994, Published May 4, 1995. Chee M,    Cronin M T, Fodor S P, Gingeras T R, Huang X C, Hubbell E A,    Lipshutz R J, Lobban P E, Miyada C G, Morris M S, Shah N, Sheldon E    L: Arrays Of Nucleic Acid Probes On Biological Chips.-   Wong C H, Whitesides G M: Enzymes in Synthetic Organic Chemistry.    Vol. 12. New York: Elsevier Science Publications, 1995.-   Yang X, Hubbard E J, Carlson M: A protein kinase substrate    identified by the two-hybrid system. Science 257(5070): 680-2,    (Jul. 31) 1992.-   Gygi S P, Rist B, Gerber S A, Turecek F, Gelb M H, Aebersold R.:    Quantitative analysis of complex protein mixtures using    isotope-coded affinity tags. Nat Biotechnol 17(10): 994-9 (October)    1999.-   Hopkins M J, Sharp R, Macfarlane G T.: Age and disease related    changes in intestinal bacterial populations assessed by cell    culture, 16S rRNA abundance, and community cellular fatty acid    profiles. Gut 48(2): 198-205 (February) 2001.-   Ritchie N J, Schutter M E, Dick R P, Myrold D D.: Use of length    heterogeneity PCR and fatty acid methyl ester profiles to    characterize microbial communities in soil. Appl Environ Microbiol    66(4): 1668-75 (April) 2000.-   Khan A A, Wang R F, Cao W W, Franklin W, Cemiglia C E.:    Reclassification of a polycyclic aromatic hydrocarbon-metabolizing    bacterium, Beijerinckia sp. strain B 1, as Sphingomonas yanoikuyae    by fatty acid analysis, protein pattern analysis, DNA-DNA    hybridization, and 16S ribosomal DNA sequencing. Int J Syst    Bacteriol 46(2): 466-9 (April) 1996.-   Peltroche-Llacsahuanga H, Schmidt S, Lutticken R, Haase G.:    Discriminative power of fatty acid methyl ester (FAME) analysis    using the microbial identification system (MIS) for Candida    (Torulopsis) glabrata and Saccharomyces cerevisiae. Diagn Microbiol    Infect Dis 38(4): 213-21 (December) 2000.-   S A Gerber et al.: Analysis of rates of multiple enzymes in cell    lysates by electrospray ionization mass spectrometry. J. Am. Chem.    Soc. 121: 1102-3 1999.    www.genomeweb.com David Goodlett discusses the latest in genomics    —ICAT reagents Written by: Marian Moser Jones Dec. 20, 2000-   WO0011208; Filed Aug. 25, 1999, Published Mar. 2, 2000. Aebersold R    H, Gelb M H, Gygi, SP, Scott C R, Turecek F, Gerber S A, Rist B:    Rapid quantitative analysis of proteins or protein function in    complex mixtures.-   WO9905221; Filed Jul. 27, 1998, Published Feb. 4, 1999. Cummins W J,    West R M, Smith J A: Cyanine Dyes.-   U.S. Pat. No. 4,876,350; Filed Dec. 16, 1987, Issued Oct. 24, 1989.    McGarrity J, Tenud L: Process for the production of (+) biotin.-   U.S. Pat. No. 5,776,723; Filed Feb. 8, 1996, Issued Jul. 7, 1998.    Herold C D, O'Hagan M: Rapid detection of mycobacterium    tuberculosis.-   U.S. Pat. No. 6,136,173; Filed Jun. 24, 1996, Issued Oct. 24, 2000.    Anderson N L, Anderson N G, Goodman J: Automated system for    two-dimensional electrophoresis.-   U.S. Pat. No. 6,127,134; Filed Apr. 20, 1995, Issued Oct. 3, 2000.    Minden J, Waggoner A: Difference gel electrophoresis using matched    multiple dyes.-   U.S. Pat. No. 6,064,754; Filed Dec. 1, 1997, Issued May 16, 2000.    Parekh R B, Amess R, Bruce J A, Prime S B, Platt A E, Stoney R M:    Computer-assisted methods and apparatus for identification and    characterization of biomolecules in a biological sample.-   U.S. Pat. No. 6,013,165; Filed May 22, 1998, Issued Jan. 11, 2000.    Wiktorowicz J E, Raysberg Y: Electrophoresis apparatus and method.-   Ausubel F M, Brent R, Kingston R E, Moore D D, Seidman J G, Smith J    A, Struhl K Editors. Current Protocols In Molecular Biology, Vol 2.    John Wiley & Sons, Inc, C) 2001, 10.21.4-10.21.6, 10.22.5-10.22.10,    10.22.14, 10.22.15-10.22.20.-   Sambrook J, Russell D W Editors. Molecular Cloning A Laboratory    Manual 3^(rd) ed. Cold Spring Harbor Laboratory Press, New York, ©    2001, 18.3, 18.62, 18.66.    1.4.8. Additional Methods for Differential Analysis    1.4.8.1. Protein Expression Profiling Using Selective Differential    Labeling

The use of mass spectrometry to identify proteins whose sequences arepresent in either DNA or protein databases is well established andintegral to the field of Proteomics. Protein and peptide mass can bedetermined at high accuracy by several mass spectrometric techniques.Peptide can be further fragmented in a tandem or ion trap massspectrometer yielding sequence information of the peptide. Both types ofmass information can be used to identify protein in a sequence database.One goal of Proteomics is to define the expressed proteins associatedwith a given cellular state and another is to quantify changes inprotein expression between cellular states. One of the new methodologiesthat have a great impact on proteome research is known as isotope-codedaffinity tag (ICAT) peptide labeling (17). The method is based on anewly synthesized class of chemical reagents (ICATs) used in combinationwith tandem mass spectrometry. The ICAT reagent contains a biotinaffinity tag and a thiol specific reactive group, which are joined by aspacer domain which is available in two forms: regular and isotopicallyheavy, which includes eight deuterium atoms. First, a reduced proteinmixture representing one cell state is derivatized with the isotopicallylight version of the ICAT reagent, while the corresponding reducedprotein mixture representing a second cell state is derivatized with theisotopically heavy version of the ICAT reagent. Second, the labeledsamples are combined and proteolytically digested to produce peptidefragments. Third, the tagged cysteine containing peptide fragments areisolated by avidin affinity chromatography. Finally, the isolated taggedpeptides are separated and analyzed by microcapillary tandem massspectrometry.

There are, however, limitations associated with their approach: (i)differential labeling reagents relied on stable isotopes which isexpensive and not very flexible to multiplex differential labeling; (ii)The moieties attached to the original peptides are approximately 500Dalton heavy, which is heavier than some peptides and is likely toaffect peptide ionization and fragmentation process; (iii) Some bonds inthe labeling reagent are week compared to the amide bond, which mightcomplicate the MS/MS spectrum, (iv) Protein expression profiling islimited to duplex comparison; (v) The affinity interaction betweenbiotin and avidin is too strong to release the immobilized peptideefficiently.

In one embodiment, this present invention provides a method forsimultaneous identification and quantification of expression levels ofindividual proteins carrying certain functional groups in their sidechains. The proteins may be analyzed in complex mixtures. The method isbased on comparison of two or more samples of proteins, one of which canbe considered as the standard sample and all others can be considered assamples under investigation.

The samples of proteins are subjected to a sequence of manipulationsincluding (i) proteolytic digestion into mixtures of peptides, (ii)treatment of the mixtures of peptides with chemical probes, (iii)washing away and discarding the unbound peptides from the mixtures, (iv)cleaving the chemical probes and the consequential release of thepeptides still carrying parts of the chemical probes into solution. Thissequence of manipulations may also include one or more auxiliarychemical and/or enzymatic modifications of functional groups in sidechains and/or in the free termini of the proteins and/or peptides inorder to achieve selective and the most favorable modification for thenext steps in the protocol. The auxiliary modifications may be performedbetween any steps of the main sequence.

The core structure of the chemical probe consists of (i) a solidsupport, (ii) a spacer, (iii) a cleavable moiety, (iv) a differentialmass labeling unit, and (v) a reactive group. The chemical probesperform three functions: (i) they attach peptides carrying specificfunctional groups in their side chains and/or termini to a solid supportby forming covalent chemical bonds to the reactive group of the probe,(ii) they provide means for selective cleavage of the attached peptidefrom the solid support such that a part of the probe still remainsattached to the peptide, and (iii) they serve as differential labelingreagents.

Differential labeling results from attaching of chemical moieties ofdifferent mass but of similar properties to a protein or a peptide suchthat peptides with the same sequence but with different labels areeluted together in the separation procedure and their ionization anddetection properties regarding mass spectrometrical analysis are verysimilar. The differential mass labeling unit remains covalently bound tothe peptide after it is cleaved from the solid support part of theprobe. Signals corresponding to peptides with the same sequence butmarked with differential mass labels are assigned to different originalprotein samples.

The auxiliary chemical and/or enzymatic modification can be used tointroduce additional differential mass labels into the peptides.

The reactive group on the chemical probe may be activated or modified bya bridging reagent prior to a reaction with mixtures of peptides. Suchactivation or modification provides for a greater flexibility in designof the chemical probe since the same core structure of a chemical probemay be tuned to increase reactivity and/or selectivity towards differentfunctional groups in side chains and/or in termini of the peptides.

After being cleaved from the solid support part of the chemical probe,the differentially labeled peptide mixtures are combined, subjected tomultidimensional chromatographic separation, and analyzed by massspectrometry methods. Mass spectrometry data is processed by specialsoftware, which allows for determination and tracing the composition andsequence of peptides in the mixture to identification of the originalproteins and their quantification.

This approach can be used for duplex or potentially multiplex proteinexpression profiling. The complexity of the sample is simplified bytargeting peptides containing particular amino acids, which selected bya reaction with chemical probes.

Novelties of this invention include: (i) design of solid phase-baseddifferential mass labeling reagents for selective peptide modification;(ii) design of various kinds of differential mass unit; (iii)combination of differential mass probes with various bridge reagent totarget certain amino acid specifically; (iv) multiplex analysis; (v)combination of proteolytic digestion and chemical and/or enzymaticmodifications in side chains and/or in termini of proteins and peptidesin order to achieve selective and the most favorable modifications forthe next steps in the protocol; (vi) combination of differentialchemical labeling with MudPIT, and possible all other protein/peptideseparation or purification technologies if necessary.

One embodiment of this invention provides reagents and procedures forquantification of protein expression using combination of selectivedifferential peptides labeling, and LC MS/MS or LC-LC MS/MS. Thisinvention overcomes the limitations inherent in traditional techniques.The basic approach described can be employed for quantitative analysisof protein expression in complex samples (such as cells, tissues, andfraction etc.), the detection and quantitation of specific proteins incomplex samples, and quantitative measurement of specific enzymaticactivities in complexed samples.

1.4.8.2. Technical Description

1. Probe Design:

The solid support part of the chemical probe may consist of any of thefollowing materials or any combination of them: gel, glass beads,magnetic beads, polymers, silicon wafer, membrane, or resin.

The spacer between the solid phase part and the cleavable unit of thechemical probe may be included for convenience and improved yields insynthetic preparation of the chemical probe. The spacer may consist of achain of 2 to 8 atoms, which can be C, O, N, B, Si, S, P, Se . . . ,covalently bound to each other. In order to satisfy the valencerequirements, the atoms may carry hydrogen atoms, halogens, or one ofthe following groups containing up to 25 atoms: alkyl, hydroxy, alkoxy,amino, alkylamino . . . The spacer may contain cyclic moieties with orwithout heteroatoms and with or without substituents.

The cleavable moiety provides means for selective detachment of thesolid phase part of the chemical probe-from the differential mass labelattached to peptide. It is designed such that it can be cleaved bytreating the probe with a chemical reagent or any kind ofelectromagnetic irradiation, photochemically, enzymatically, orthermally.

Differential mass labeling units differ in molecular mass, but do notdiffer in retention properties regarding the separation method used andin ionization and detection properties regarding the mass spectrometrymethods used. These moieties differ either in their isotope composition(isotopic labels) or they differ structurally by a rather smallfragment, which change does not alter the properties stated above(homologous labels).

The isotopic labels can be presented by general formulae:

-   -   Z^(A) and Z^(B)        Z^(A) and Z^(B)=R-Z¹-A¹-Z²-A²-Z³-A³-Z⁴-A⁴-

-   Z¹, Z², Z³, and Z⁴ independently of one another can be selected from    0, OC(O), OC(S), OC(O) 0, OC(O)NR, OC(S)NR, OSiRR¹, S, SC(O), SC(S),    SS, S(O), S(O₂), NR, NRR¹+, C(O), C(O)O, C(S), C(S)O, C(O)S, C(O)NR,    C(S)NR, SiRR¹, (Si (RR¹)O)n, SnRR¹, Sn (RR¹) O, BR (OR¹), BRR¹, B    (OR)(OR¹), OBR (OR¹), OBRR¹, OB (OR)(OR¹) or Z¹-Z⁴ may be absent;    -   A¹, A², A³, and A⁴ independently of one another can be selected        from (CRR¹)_(n), in which some single C—C bonds may be replaced        with double or triple bonds, in which case some groups R and R¹        will be absent, o-arylene, m-arylene, p-arylene with up to 6        substituents, carbocyclic, bicyclic, or tricyclic fragments with        up to 8 atoms in the cycle with or without heteroatoms (O, N, S)        and with or without substituents, or A¹-A⁴ may be absent;

-   R, R¹ independently from other R and R¹ in Z¹-Z⁴ and independently    from other R and R¹ in A¹-A⁴ is hydrogen, halogen, an alkyl,    alkenyl, alkynyl, or aryl group; n in Z¹-Z⁴ is independent of n in    A¹-A⁴ and is a whole number that can have value from 0 to 21.

Z^(A) has the same structure as Z^(B), but they have different isotopecomposition. For instance, if Z^(A) contains x number of protons, Z^(B)may contain y number of deuterons in the place of protons, and,correspondingly, x−y number of protons remaining; and/or if Z^(A)contains x number of borons-10, Z^(B) may contain y number of borons-11in the place of borons-10, and, correspondingly, x−y number of borons-10remaining; and/or if Z^(A) contains x number of carbons-12, Z^(B) maycontain y number of carbons-13 in the place of carbons-12, and,correspondingly, x−y number of carbons-12 remaining; and/or if Z^(A)contains x number of nitrogens-14, Z^(B) may contain y number ofnitrogens-15 in the place of nitrogens-14, and, correspondingly, x−ynumber of nitrogens-14 remaining; and/or if Z^(A) contains x number ofsulfurs-32, Z^(B) may contain y number of sulfurs-34 in the place ofsulfurs-32, and, correspondingly, x−y number of sulfurs-32 remaining;and so on for all elements which may be present and have differentstable isotopes.

x and y are whole numbers between 1 and 21 such that x is greater thany.

An example of an isotopical label pairs/series: (CD₂)_(n)/(CH₂)_(n),where n=0, 1, 2, . . . , 21; (delta mass=2n).

The homologous reagents can be presented by general formulae:Z^(A) and Z^(B) where Z^(A) and Z^(B)=R-Z¹-A¹Z²-A²-Z³-A³-Z⁴-A⁴-

-   Z¹, Z², Z³, and Z⁴ independently of one another can be selected from    0, OC(O), OC(S), OC(O)O, OC(O)NR, OC(S)NR, OSIRR¹, S, SC(O), SC(S),    SS, S(O), S(O₂), NR, NRR¹⁺, C(O), C(O)O, C(S), C(S)O, C(O)S, C(O)NR,    C(S)NR, SiRR¹, (Si(RR¹)O)_(n), SnRR¹, Sn(RR¹)O, BR(OR¹), BRR¹,    B(OR)(OR¹), OBR(OR¹), OBRR¹, OB(OR)(OR¹) or Z¹-Z⁴ may be absent;    -   A¹, A², A³, and A⁴ independently of one another can be selected        from (CRR¹)_(n), in which some single C—C bonds may be replaced        with double or triple bonds, in which case some groups R and R¹        will be absent, o-arylene, m-arylene, p-arylene with up to 6        substituents, carbocyclic, bicyclic, or tricyclic fragments with        up to 8 atoms in the cycle with or without heteroatoms (O, N, S)        and with or without substituents, or A¹-A⁴ may be absent;-   R, R¹ independently from other R and R¹ in Z¹-Z⁴ and independently    from other R and R¹ in A¹-A⁴ is hydrogen, halogen, an alkyl,    alkenyl, alkynyl, or aryl group; n in Z¹-Z⁴ is independent of n in    A¹-A⁴ and is a whole number that can have value from 0 to 21.    -   Z^(A) has a similar structure to that of Z^(B), but Z^(A) has x        extra —CH₂— fragment(s) in one or more A¹-A⁴ fragments, and/or        Z^(A) has x extra —CF₂— fragment(s) in one or more A¹-A⁴        fragments; and/or if Z^(A) contains x number of protons, Z^(B)        may contain y number of halogens in the place of protons, and,        correspondingly, x−y number of protons remaining in one or more        A¹-A⁴ fragments; and/or Z^(A) has x extra —O— fragment(s) in one        or more A¹-A⁴ fragments; and/or Z^(A) has x extra —S—        fragment(s) in one or more A¹-A⁴ fragments; and/or if Z^(A)        contains x number of —O— fragment(s), Z^(B) may contain y number        of —S— fragment(s) in the place of —O— fragment(s), and,        correspondingly, x−y number of fragment(s) remaining in one or        more A¹-A⁴ fragments; and so on.        x and y are whole numbers between 1 and 21 such that x is        greater than y.

An examples of homologous label pairs/series: (CH₂)_(n)(CH₂)_(n+m),where n=0, 1, 2, . . . , 21; m=1, 2, . . . , 21 (delta mass=14m).

2. Bridging and Activating Reagents: We may Either Utilize SomeCommercial Available Cross Linkers or Synthesized our Own:

-   -   a. Reactive site 1: probe specific    -   b. Reactive site 2: amino acid specific        3. Methods for Peptide/Protein Separation and Detection:

On line 2 dimensional capillary LC ESI MS/MS (MuDPIT) as described inthe global differential profiling disclosure, or I D LC ESI MS/MS, MALDIMS.

4. Sequence Analysis and Quantification:

Peptides are quantified by measuring in the MS mode the relative signalintensities for pairs or series of peptide ions of identical sequencethat are tagged differentially, which therefore differ in mass by themass differential encoded within the differential labeling reagents.Peptide sequence information is automatically generated by selectingpeptide ions of a particular mass-to-charge (m/z) ratio forcollision-induced dissociation (CID) in the mass spectrometer operatingin the tandem MS mode. (Link et al, Electrophoresis 18: 1314-34 (1997);Gygi et al Nature Biotechnol 17: 994-9) (1999); Gygi et al., cell Biol19: 1720-30 (1999)).

The resulting tandem mass spectra can be correlated to sequencedatabases to identify the protein from which the sequenced peptideoriginated. Currently commercial available softwares are Turbo SEQUESTby Thermofinigan, MassScot by Matrix Science, and Sonar MS/MS byProteometrics. Special software development will be necessary forautomated relative quantification.

One suggested approach of practicing the invention:

-   -   1. Protein sample preparation, which may include protein        denaturation, reduction, and proteolytic digestion    -   2. Treatment of the probe with a desired activating or bridging        reagent    -   3. Treatment of the activated probe with a mixture of peptides    -   4. Wash off unbound peptides, which don't have the targeted        amino acid    -   5. Combining modified differential labeled peptide mixture    -   6. Release peptides by cleaving the probe (steps 5 and 6 can be        switched)    -   7. Removing solvent or desalting if necessary    -   8. Redisoviing peptide in LC loading buffer    -   9. LC ESI MS and MS/MS analysis MALDI MS and MS/MS analysis    -   10. Database searching and data analysis        1.5. Metabolomics and Lipidomics

Additional holistic monitoring approaches, metabolomics and lipidomics,include profiling metabolite pools, carbohydrates, lipids,glycoproteins, and glycolipids Various chromatographic methods and otherqualitative and/or quantitative methods could be utilized tocharacterize lipid profiles. In the area of metabolomics, methods thatcompare concentrations of metabolites/small molecules, using a varietyof chemical analysis tools, e.g. mass spec, NMR, other spectroscopictechniques, biosensors could be utilized. For some specific methodexamples, see the following references: J. C. Lindon et al., Prog. NMRSpear., 29, 1 (1996) 1-3. C. Lindon et al., Drug. Met. Rev., 29, 705(1997); B. Vogler et al., J Nat. Prod., 61, 175 (1998); and JA.Wolfender et al., Curr. Org. Chem. 2, 575 (1998); J. K. Nicholson etal., Xenobiotica, 29, 1181(1999).

1.6. Screening Tools

1.6.1. FACS

Fluorescence activated cell sorting (FACS) methods are also a powerfultool for selection/screening. In some instances a fluorescent-moleculeis made within a cell (e.g., green fluorescent protein). The cellsproducing the protein can simply be sorted by FACS. Gel microdroptechnology allows screening of cells encapsulated in agarose microdrops(Weaver et al. Methods 2: 234-247 (1991)). In this technique productssecreted by the cell (such as antibodies or antigens) are immobilizedwith the cell that generated them. Sorting and collection of the dropscontaining the desired product thus also collects the cells that madethe product, and provides a ready source for the cloning of the genesencoding the desired functions. Desired products can be detected byincubating the encapsulated cells with fluorescent antibodies (Powell etal. Bio/Technology 8: 333-337 (1990)). FACS sorting can also be used bythis technique to assay resistance to toxic compounds and antibiotics byselecting droplets that contain multiple cells (i.e., the product ofcontinued division in the presence of a cytotoxic compound; Goguen etal. Nature 363: 189-190 (1995)). This method can select for any enzymethat can change the fluorescence of a substrate that can be immobilizedin the agarose droplet.

1.6.2. Reporter Molecule

In some embodiments of the invention, screening can be accomplished byassaying reactivity with a reporter molecule reactive with a desiredfeature of, for example, a gene product. Thus, specific functionalitiessuch as antigenic domains can be screened with antibodies specific forthose determinants.

1.6.3. Cell-Cell Indicator

In other embodiments of the invention, screening is preferably done witha cell-cell indicator assay. In this assay format, separate librarycells (Cell A, the cell being assayed) and reporter cells (Cell B, theassay cell) are used.

Only one component of the system, the library cells, is allowed toevolve. The screening is generally carried out in a two-dimensionalimmobilized format, such as on plates. The products of the metabolicpathways encoded by these genes (in this case, usually secondarymetabolites such as antibiotics, polyketides, carotenoids, etc.) diffuseout of the library cell to the reporter cell. The product of the librarycell may affect the reporter cell in one of a number of ways.

The assay system (indicator cell) can have a simple readout (e.g., greenfluorescent protein, luciferase, beta-galactosidase) which is induced bythe library cell product but which does not affect the library cell. Inthese examples the desired product can be detected by calorimetricchanges in the reporter cells adjacent to the library cell.

1.6.4. Feedback Mechanism

In other embodiments, indicator cells can in turn produce something thatmodifies the growth rate of the library cells via a feedback mechanism.Growth rate feedback can detect and accumulate very small differences.For example, if the library and reporter cells are competing fornutrients, library cells producing compounds to inhibit the growth ofthe reporter cells will have more available nutrients, and thus willhave more opportunity for growth. This is a useful screen forantibiotics or a library of polyketide synthesis gene clusters whereeach of the library cells is expressing and exporting a differentpolyketide gene product.

1.6.5. Screening Secreted Molecules

Another variation of this theme is that the reporter cell for anantibiotic selection can itself secrete a toxin or antibiotic thatinhibits growth of the library cell. Production by the library cell ofan antibiotic that is able to suppress growth of the reporter cell willthus allow uninhibited growth of the library cell.

Conversely, if the library is being screened for production of acompound that stimulates the growth of the reporter cell (for example,in improving chemical syntheses, the library cell may supply nutrientssuch as amino acids to an auxotrophic reporter, or growth factors to agrowth-factor-dependent reporter. The reporter cell in turn shouldproduce a compound that stimulates the growth of the library cell.Interleukins, growth factors, and nutrients are possibilities. Furtherpossibilities include competition based on ability to kill surroundingcells, positive feedback loops in which the desired product made by theevolved cell stimulates the indicator cell to produce a positive growthfactor for cell A, thus indirectly selecting for increased productformation.

In some embodiments of the invention it can be advantageous to use adifferent organism (or genetic background) for screening than the onethat will be used in the final product. For example, markers can beadded to DNA constructs used for recursive sequence recombination tomake the microorganism dependent on the constructs during theimprovement process, even though those markers may be undesirable in thefinal recombinant microorganism.

Likewise, in some embodiments it is advantageous to use a differentsubstrate for screening an evolved enzyme than the one that will be usedin the final product. For example, Evnin et al. (Proc. Natl. Acad. Sci.U.S.A. 87: 6659-6663 (1990)) selected trypsin variants with alteredsubstrate specificity by requiring that variant trypsin generate anessential amino acid for an arginine auxotroph by cleaving argininebeta-naphthylamide. This is thus a selection for arginine-specifictrypsin, with the growth rate of the host being proportional to that ofthe enzyme activity.

The pool of cells surviving screening and/or selection is enriched forrecombinant genes conferring the desired phenotype (e.g. alteredsubstrate specificity, altered biosynthetic ability, etc.). Furtherenrichment can be obtained, if desired, by performing a second round ofscreening and/or selection without generating additional diversity.

The recombinant gene or pool of such genes surviving one round ofscreening/selection forms one or more of the substrates for a secondround of recombination. Again, recombination can be performed in vivo orin vitro by any of the recursive sequence recombination formatsdescribed above.

If recursive sequence recombination is performed in vitro, therecombinant gene or genes to form the substrate for recombination shouldbe extracted from the cells in which screening/selection was performed.Optionally, a subsequence of such gene or genes can be excised for moretargeted subsequent recombination. If the recombinant gene(s) arecontained within episomes, their isolation presents no difficulties. Ifthe recombinant genes are chromosomally integrated, they can be isolatedby amplification primed from known sequences flanking the regions inwhich recombination has occurred. Alternatively, whole genomic DNA canbe isolated, optionally amplified, and used as the substrate forrecombination. Small samples of genomic DNA can be amplified by wholegenome amplification with degenerate primers (Barrett et al. NucleicAcids Research 23: 3488-3492 (1995)). These primers result in a largeamount of random 3′ ends, which can undergo homologous recombinationwhen reintroduced into cells.

If the second round of recombination is to be performed in vivo, as isoften the case, it can be performed in the cell survivingscreening/selection, or the recombinant genes can be transferred toanother cell type (e.g., a cell is type having a high frequency ofmutation and/or recombination). In this situation, recombination can beeffected by introducing additional DNA segment(s) into cells bearing therecombinant genes. In other methods, the cells can be induced toexchange genetic information with each other by, for example,electroporation. In some methods, the second round of recombination isperformed by dividing a pool of cells surviving screening/selection inthe first round into two subpopulations. DNA from one subpopulation isisolated and transfected into the other population, where therecombinant gene(s) from the two subpopulations recombine to form afurther library of recombinant genes. In these methods, it is notnecessary to isolate particular genes from the first subpopulation or totake steps to avoid random shearing of DNA during extraction. Rather,the whole genome of DNA sheared or otherwise cleaved into manageablesized fragments is transfected into the second subpopulation. Thisapproach is particularly useful when several genes are being evolvedsimultaneously and/or the location and identity of such genes withinchromosome are not known.

The second round of recombination is sometimes performed exclusivelyamong the recombinant molecules surviving selection. However, in otherembodiments, additional substrates can be introduced. The additionalsubstrates can be of the same form as the substrates used in the firstround of recombination, i.e., additional natural or induced mutants ofthe gene or cluster of genes, forming the substrates for the firstround. Alternatively, the additional substrate(s) in the second round ofrecombination can be exactly the same as the substrate(s) in the firstround of replication.

After the second round of recombination, recombinant genes conferringthe desired phenotype are again selected. The selection process proceedsessentially as before. If a suicide vector bearing a selective markerwas used in the first round of selection, the same vector can be usedagain. Again, a cell or pool of cells surviving selection is selected.If a pool of cells, the cells can be subject to further enrichment.

1.4. Screening for Various Potential Applications

1.4.1 Novel Drugs: Identifying Targets

The invention relates to procedures that can be applied to identifyingcompounds that bind to and modulate the function of target components ofa cell whose function is known or unknown, and cell components that arenot amenable to other screening methods. The invention relates togenerating and/or identifying a compound that binds to and modulates(inhibits or enhances) the function of a component of a cell, therebyproducing a phenotypic effect in the cell. Such a screen may involveidentifying a biomolecule that 1) binds to, in vitro, a component of acell that has been isolated from other constituents of the cell and that2) causes, in vivo, as seen in an assay upon intracellular expression ofthe biomolecule, a phenotypic effect in the cell which is the usualproducer and host of the target cell component. In an assaydemonstrating characteristic 2) above, intracellular production of thebiomolecule can be in cells grown in culture or in cells introduced intoan animal. Further methods within these procedures are those methodscomprising an assay for a phenotypic effect in the cell uponintracellular production of the biomolecule, either in cells in cultureor in cells that have been introduced into one or more animals, and anassay to identify one or more compounds that behave as competitors ofthe biomolecule in an assay of binding to the target cell component. Thetarget cell component in this embodiment and in other embodiments notlimited to pathogens can be one that is found in mammalian cells,especially cells of a type found to cause or contribute to disease orthe symptoms of disease (e.g., cells of tumors or cells of other typesof hyperproliferative disorders).

1.7.1. Process for Identifying One or More Compounds that Produce aPhenotypic Effect on a Cell

One procedure envisioned in the invention is a process for identifyingone or more compounds that produce a phenotypic effect on a cell. Theprocess is at the same time a method for target validation. The processis characterized by identifying a biomolecule which binds an isolatedtarget cell component, constructing cells comprising the target cellcomponent and further comprising a gene encoding the biomolecular binderwhich can be expressed to produce the biomolecular binder, testing theconstructed cells for their ability to produce, upon expression of thegene encoding the biomolecular binder, a phenotypic effect in the cells(e.g., inhibition of growth), wherein the test of the constructed cellscan be a test of the cells in culture or a test of the cells afterintroducing them into host animals, or both, and further, identifying,for a biomolecular binder that caused the phenotypic effect, one or morecompounds that compete with the biomolecular binder for binding to thetarget cell component.

A test of the constructed cells after introducing them into host animalsis especially well-suited to assessing whether a biomolecular binder canproduce a particular phenotype by the expression (regulatable by theresearcher) of a gene encoding the biomolecular binder. In this method,cells are constructed which have a gene encoding the biomolecularbinder, and wherein the biomolecular binder can be produced byregulation of expression of the gene. The constructed cells areintroduced into a set of animals. Expression of the gene encoding thebiomolecular binder is regulated in one group of the animals (testanimals) such that the biomolecular binder is produced. In another groupof animals, the gene encoding the biomolecular binder is regulated suchthat the biomolecular binder is not produced (control animals). Thecells in the two groups of animals are monitored for a phenotypic change(for example, a change in growth rate). If the phenotypic change isobserved in cells in the test animals and not in the cells in thecontrol animals, or to a lesser extent in the control animals, then thebiomolecular binder has been proven to be effective in binding to itstarget cell component under in vivo conditions.

A further embodiment of the invention is a method for determiningwhether a target cell component of a particular cell type (a “firstcell”) is essential to producing a phenotypic effect on the first cell,the method having the steps:

isolating the target component of the first cell; identifying abiomolecular binder of the isolated target component of the first cell;constructing a second type of cells (“second cell”) comprising thetarget component and a regulable, exogenous gene encoding thebiomolecular binder; and testing the second cell in culture for analtered phenotypic effect, upon production of the biomolecular binder inthe second cell; whereby, if the second cell shows the alteredphenotypic effect upon production of the biomolecular binder, then thetarget component of the first cell is essential to producing thephenotypic effect on the first cell. The target cell component in thisembodiment and in other embodiments not limited to pathogens can be onethat is found in mammalian cells, especially cells of a type found tocause or contribute to disease or the symptoms of disease (e.g., cellsof tumors or cells of other types of hyperproliferative disorders).

1.7.3. Identifying a Biomolecular Inhibitor of Growth of Pathogen Cells

One embodiment of the invention is a method for identifying abiomolecular inhibitor of growth of pathogen cells by using cell culturetechniques, comprising contacting one or more types of biomolecules withisolated target cell component of the pathogen, applying a means ofdetecting bound complexes of biomolecules and target cell component,whereby, if the bound complexes are detected, one or more types ofbiomolecules have been identified as a biomolecular binder of the targetcell component, constructing a pathogen strain having a regulatable geneencoding the biomolecular binder, regulating expression of the geneencoding the biomolecular binder to express the gene; and monitoringgrowth of the pathogen cells in culture relative to suitable controlcells, whereby, if growth of the pathogen cells is decreased compared togrowth of suitable control cells, then the biomolecule is a biomolecularinhibitor of growth of the pathogen cells.

1.7.4. Identifying Compounds that Inhibit Infection of a Mammal by aPathogen

A further embodiment of the invention is a method, employing an animaltest, for identifying one or more compounds that inhibit infection of amammal by a pathogen by binding to a target cell component, comprisingconstructing a pathogen comprising a regulable gene encoding abiomolecule which binds to the target cell component, infecting testanimals with the pathogen, regulating expression of the regulable geneto produce the biomolecule, monitoring the test animals and suitablecontrol animals for signs of infection, wherein observing fewer or lesssevere signs of infection in the test animals than in suitable controlanimals indicates that the biomolecule is a biomolecular inhibitor ofinfection, and identifying one or more compounds that compete with thebiomolecular inhibitor of growth for binding to the target cellcomponent (as by employing a competitive binding assay), then thecompound inhibits infection of a mammal by a pathogen by binding to atarget.

The competitive binding assay to identify binding analogs ofbiomolecular binders, which have been proven to bind to their targets inan intracellular test of binding, can be applied to any target for whicha biomolecular binder has been identified, including targets whosefunction is unknown or targets for which other types of assays are noteasily developed and performed. Therefore, the method of the inventionoffers the advantage of decreasing assay development time when using agene product of known function as a target cell component and theadvantage of bypassing the major hurdle of gene function identificationwhen using a gene product of unknown function as a target cellcomponent.

Other embodiments of the invention are cells comprising a biomoleculeand a target cell component, wherein the biomolecule is produced byexpression of a regulable gene, and wherein the biomolecule modulatesfunction of the target cell component, thereby causing a phenotypicchange in the cells. Yet other embodiments are cells comprising abiomolecule and a target cell component, wherein the biomolecule is abiomolecular binder of the target cell component, and is encoded by aregulatable gene. The cells can include mammalian cells or cells of apathogen, for instance, and the phenotypic change can be a change ingrowth rate.

The pathogen can be a species of bacteria, yeast, fungus, or parasite,for example.

1.7.5. Intracellular Validation of a Biomolecule

Described herein are methods that result in the identification ofcompounds that cause a phenotypic effect on a cell. The general stepsdescribed herein to find a compound for drug development can be thoughtof as these: (1) identifying a biomolecule that can bind to an isolatedtarget cell component in vitro, (2) confirming that the biomolecule,when produced in cells with the target cell component, can cause adesired phenotypic effect and (3) identifying, by an in vitro screeningmethod, for example, compounds that compete with the biomolecule forbinding to the target cell component. Central to these methods isgeneral step (2) above, intracellular validation of a biomoleculecomprising one or more steps that determine whether a biomolecule cancause a phenotypic effect on a cell, when the biomolecule is produced bythe expression (which can be regulatable) of a gene in the cell. As usedin general step (2), a biomolecule is a gene product (e.g., polypeptide,RNA, peptide or RNA oligonucleotide) of an exogenous gene—a gene whichhas been introduced in the course of construction of the cell.

Biomolecules that bind to and alter the function of a candidate targetare identified by various in vitro methods. Upon production of thebiomolecule within a cell either in vitro or within an animal modelsystem, the biomolecule binds to a specific site on the target, altersits intracellular function, and hence produces a phenotypic change (e.g.cessation of growth, cell death). When the biomolecule is produced inengineered pathogen cells in an animal model of infection, cessation ofgrowth or death of the engineered pathogen cells leads to the clearingof infection and animal survival, demonstrating the importance of thetarget in infection and thereby validating the target.

A further embodiment of this invention provides for identifying abiomolecule that produces a phenotypic effect on a cell (wherein thecell can be, for instance, a pathogen cell or a mammalian cell) and (2)simultaneous intracellular target validation (see reference: patents??).

1.7.6. Methods for Identifying Compounds that Inhibit the Growth ofCells Having a Target Cell Component

The invention includes methods for identifying compounds that inhibitthe growth of cells having a target cell component. The target cellcomponent can first be identified as essential to the growth of thecells in culture and/or under conditions in which it is desired that thegrowth of the cells be inhibited. These methods can be applied, forexample, to various types of cells that undergo abnormal or undesirableproliferation, including cells of neoplasms (tumors or growths, eitherbenign or malignant) which, as known in the art, can originate from avariety of different cell types. Such cells can be referred to, forexample, as being from adenomas, carcinomas, lymphomas or leukemias. Themethod can also be applied to cells that proliferate abnormally incertain other diseases, such as arthritis, psoriasis or autoimmunediseases.

If intracellular expression of the biomolecular binder inhibits thefunction of a target essential for growth (presumably by binding to thetarget at a biologically relevant site) cells monitored in step (2) willexhibit a slow growth or no growth phenotype. Targets found to beessential for growth by these methods are validated starting points fordrug discovery, and can be incorporated into assays to identify morestable compounds that bind to the same site on the target as thebiomolecule. Where the cells are pathogen cells and the desiredphenotypic change to be monitored is inhibition of growth, the inventionprovides a procedure to examine the activity of target (pathogen) cellcomponents in an animal infection model.

1.7.7. Study as a Target Cell Component a Gene Product of a ParticularCell Type

In the course of this method, it may be decided to study as a targetcell component a gene product of a particular cell type (e.g., a type ofpathogenic bacteria), wherein the target cell component is already knownas being encoded by a characterized gene, as a potential target for amodulator to be identified. In this case, the target cell component canbe isolated directly from the cell type of interest, assuming suitableculture methods are available to grow a sufficient number of cells,using methods appropriate to the type of cell component to be isolated(e.g., protein purification methods such as differential precipitation,ion exchange chromatography, gel chromatography, affinitychromatography, HPLC.

1.7.8. Target Cell Component can be Produced Recombinantly

Alternatively, the target cell component can be produced recombinantly,which requires that the gene encoding the target cell component beisolated from the cell type of interest. This can be done by any numberof methods, for example known methods such as PCR, using template DNAisolated from the pathogen or a DNA library produced from the pathogenDNA, and using primers based on known sequences or combinations of knownand unknown sequences within or external to the chosen gene. See, forexample, methods described in “The Polymerase Chain Reaction,” Chapter15 of Current Protocols in Molecular Biology, (Ausubel, F. M. et al.,eds), John Wiley & Sons, New York, 1998. Other methods include cloning agene from a DNA library (e.g., a cDNA library from a eucaryoticpathogen) into a vector (e.g., plasmid, phage, phagemid, virus, etc.)and applying a means of selection or screening, to clones resulting froma transformation of vectors (including a population of vectors nowhaving inserted genes) into appropriate host cells. The screening methodcan take advantage of properties given to the host cells by theexpression of the inserted chosen gene (e.g., detection of the geneproduct by antibodies directed against it, detection of an enzymaticactivity of the gene product), or can detect the presence of the geneitself (for instance, by methods employing nucleic acid hybridization).For methods of cloning genes in E. coli, which also may be applicable tocloning in other bacterial species, see, for example, “Escherichia coli,Plasmids and Bacteriophages,” Chapter I of Current Protocols inMolecular Biology, (Ausubel, F. M. et al., eds), John Wiley & Sons, NewYork, 1998. For methods applicable to cloning genes of eukaryoticorigin, see Chapter 5 (“Construction of Recombinant DNA Libraries”),Chapter 9 (“Introduction of DNA Into Mammalian Cells”) and Chapter 6(“Screening of Recombinant DNA Libraries”) of Current Protocols inMolecular Biology, (Ausubel, F. M. et al., eds), John Wiley & Sons, NewYork, 1998.

Target proteins can be expressed with E. coli or other prokaryotic geneexpression systems, or in eukaryotic gene expression systems. Since manyeukaryotic proteins carry unique modifications that are required fortheir activities, e.g. glycosylation and methylation, protein expressioncan in some cases be better carried out in eukaryotic systems, such asyeast, insect, or mammalian cells that can perform these modifications.Examples of these expression systems have been reviewed in the followingliterature: Methods in Enzymology, Volume 185, eds D. V. Goeddel,Academic Press, San Diego, 1990; Geisse et al, Protein Expression andPurification 8: 271-282, 1996; Simonsen and McGrogan, Biologicals 22:85-94; Jones and Morikawa, Current Opinions in Biotechnologies 7:512-516, 1996; Possee, Current Opinions in Biotechnologies 8: 569-572.

Where a gene encoding a chosen target cell component has not beenisolated previously, but is thought to exist because homologs of thegene product are known in other species, the gene can be identified andcloned by a method such as that used in Shiba et al., U.S. Pat. No.5,759,833, Shiba et al., U.S. Pat. No. 5,629,188, Martinis et al., U.S.Pat. No. 5,656,470 and Sassanfar et al., U.S. Pat. No. 5,756,327. Theteachings of these four patents are incorporated herein by reference intheir entirety.

1.7.9. Method Should be Used with Target Cell Components which have notbeen Previously Isolated or Characterized and whose Functions areUnknown

It is an advantage of the target validation method that it can be usedwith target cell components which have not been previously isolated orcharacterized and whose functions are unknown. In this case, a segmentof DNA containing an open reading frame (ORF; a cDNA can also be used,as appropriate to a eukaryotic cell) which has been isolated from a cellof a type that is to be an object of drug action (e.g., tumor cell,pathogen cell) can be cloned into a vector, and the target gene productof the ORF can be produced in host cells harboring the vector. The geneproduct can be purified and further studied in a manner similar to thatof a gene product that has been previously isolated and characterized.

In some cases, the open reading frame (in some cases, cDNA) can beisolated from a source of DNA of the cells of interest (genomic DNA or alibrary, as appropriate), and inserted into a fusion protein or fusionpolypeptide construct. This construct can be a vector comprising anucleic acid sequence which provides a control region (e.g., promoter,ribosome binding site) and a region which encodes a peptide orpolypeptide portion of the fusion polypeptide wherein the polypeptideencoded by the fusion vector endows the fusion polypeptide with one ormore properties that allow for the purification of the fusionpolypeptide. For example, the vector can be one from the pGEX series ofplasmids (Pharmacia) designed to produce fusions with glutathioneS-transferase.

1.7.10. Host Cells

The isolated DNA having an open reading frame, whether encoding a knownor an as yet unidentified gene product, when inserted into an expressionconstruct, can be expressed to produce the target cell component in hostcells. Host cells can be, for example, Gram-negative or Gram-positivebacterial cells such as Escherichia coli or Bacillus subtilis,respectively, or yeast cells such as Saccharomyces cerevisiae,Schizosaccharomyces pombe or Pichia pastoris. It is preferable that thetarget cell component to be used in target validation studies beproduced in a host that is genetically related to the pathogen fromwhich the gene encoding it was isolated. For example, for aGram-negative bacterial pathogen, an E. coli host is preferred over aPichia pastoris host. The target cell component so produced can then beisolated from the host cells. Many protein purification methods areknown that separate proteins on the basis of, for instance, size,charge, or affinity for a binding partner (e.g., for an enzyme, abinding partner can be a substrate or substrate analog), and thesemethods can be combined in a sequence of steps by persons of skill inthe art to produce an effective purification scheme. For methods tomanipulate RNA, see, for example, Chapter 4 in Current Protocols inMolecular Biology (Ausubel, F. M. et al., eds), John Wiley & Sons, NewYork, 1998.

An isolated cell component or a fusion protein comprising the cellcomponent can be used in a test to identify one or more biomolecularbinders of the isolated product (general step (1)). A biomolecularbinder of a target cell component can be identified by in vitro assaysthat test for the formation of complexes of target and biomolecularbinder noncovalently, bound to each other. For example, the isolatedtarget can be contacted with one or more types of biomolecules underconditions conducive to binding, the unbound biomolecules can be removedfrom the targets, and a means of detecting bound complexes ofbiomolecules and targets can be applied. The detection of the boundcomplexes can be facilitated by having either the potential biomolecularbinders or the target labeled or tagged with an adduct that allowsdetection or separation (e.g., radioactive isotope or fluorescent label;streptavidin, avidin or biotin affinity label).

Alternatively, both the potential biomolecular binders and the targetcan be differentially labeled. For examples of such methods see, e.g.,WO 98/19162.

1.7.11. Biomolecules to be Tested and Means for Detection

The biomolecules to be tested for binding to a target can be from alibrary of candidate biomolecular binders, (e.g., a peptide oroligonucleotide library). For example, a peptide library can bedisplayed on the coat protein of a phage (see, for examples of the useof genetic packages such as phage display libraries, Koivunen, E. etal., J. Biol. Chem. 268: 20205-20210 (1993)). The biomolecules can bedetected by means of a chemical tag or label attached to or integratedinto the biomolecules before they are screened for binding properties.For example, the label can be a radioisotope, a biotin tag, or afluorescent label. Those molecules that are found to bind to the targetmolecule can be called biomolecular binders.

1.7.12. Fusion Proteins

An isolated target cell component, an antigenically similar portionthereof, or a suitable fusion protein comprising all of or a portion ofor the entire target can be used in a method to select and identifybiomolecules which bind specifically to the target. Where the targetcell component comprises a protein, fusion proteins comprising all of,or a portion of, the target linked to a second moiety not occurring inthe target as found in nature, can be prepared for use in anotherembodiment of the method. Suitable fusion proteins for this purposeinclude those in which the second moiety comprises an affinity ligand(e.g., an enzyme, antigen, epitope). The fusion proteins can be producedby the insertion of a gene encoding a target or a suitable portion ofsuch gene into a suitable expression vector, which encodes an affinityligand (e.g., pGEX4T-2 and pET-15b, encoding glutathione S— transferaseand His-Tag affinity ligands, respectively). The expression vector canbe introduced into a suitable host cell for expression. Host cells arelysed and the lysate, containing fusion protein, can be bound to asuitable affinity matrix by contacting the lysate with an affinitymatrix under conditions sufficient for binding of the affinity ligandportion of the fusion protein to the affinity matrix.

1.7.12.1. Fusion Protein can be Immobilized

In one embodiment, the fusion protein can be immobilized on a suitableaffinity matrix under conditions sufficient to bind the affinity ligandportion of the fusion protein to the matrix, and is contacted with oneor more candidate biomolecules (e.g., a mixture of peptides) to betested as biomolecular binders, under conditions suitable for binding ofthe biomolecules to the target portion of the bound fusion protein.Next, the affinity matrix with bound fusion protein can be washed with asuitable wash buffer to remove unbound biomolecules and non-specificallybound biomolecules. Biomolecules which remain bound can be released bycontacting the affinity matrix with fusion protein bound thereto with asuitable elution buffer. Wash buffer can be formulated to permit bindingof the fusion protein to the affinity matrix, without significantlydisrupting binding of specifically bound biomolecules. In this aspect,elution buffer can be formulated to permit retention of the fusionprotein by the affinity matrix, but can be formulated to interfere withbinding of the test biomolecule(s) to the target portion of the fusionprotein. For example, a change in the ionic strength or pH of theelution buffer can lead to release of biomolecules, or the elutionbuffer can comprise a release component or components designed todisrupt binding of biomolecules to the target portion of the fusionprotein.

Immobilization can be performed prior to, simultaneous with, or aftercontacting, the fusion protein with biomolecule, as appropriate. Variouspermutations of the method are possible, depending upon factors such asthe biomolecules tested, the affinity matrix-ligand pair selected, andelution buffer formulation. For example, after the wash step, fusionprotein with biomolecules bound thereto can be eluted from the affinitymatrix with a suitable elution buffer (a matrix elution buffer, such asglutathione for a GST fusion). Where the fusion protein comprises acleavable linker, such as a thrombin cleavage site, cleavage from theaffinity ligand can release a portion of the fusion with thebiomolecules bound thereto. Bound biomolecule can then be released fromthe fusion protein or its cleavage product by an appropriate method,such as extraction.

1.7.12. Various Methods to Identify Biomolecular Binders

One or more candidate biomolecular binders can be tested simultaneously.Where a mixture of biomolecules is tested, the biomolecules selected bythe foregoing processes can be separated (as appropriate) and identifiedby suitable methods (e.g., PCR, sequencing, chromatography). Largelibraries of biomolecules (e.g., peptides, RNA oligonucleotides)produced by combinatorial chemical synthesis or other methods can betested (see e. a., Ohimeyer, M. H. J. et al., Proc. Natl. Acad. Sci. USA90: 10922-10926 (1993) and DeWitt, S. H. et al., Proc. Natl. Acad. Sci.USA 90: 6909-6913 (1993), relating to tagged compounds; see also Rutter,W. J. et al. U.S. Pat. No. 5,010,175; Huebner, V. D. et al., U.S. Pat.No. 5,182,366; and Geysen, H. M., U.S. Pat. No. 4,833,092). Randomsequence RNA libraries (see Ellington, A. D. et al., Nature 346: 818-822(1990); Bock, L. C. et al., Nature 355: 584-566 (1992); and Szostak, J.W., Trends in Biochem. Sci. 17: 89-93 (March, 1992)) can also bescreened according to the present method to select RNA molecules whichbind to a target. Where biomolecules selected from a combinatoriallibrary by the present method carry unique tags, identification ofindividual biomolecules by chromatographic methods is possible. Wherebiomolecules do not carry tags, chromatographic separation, followed bymass spectrometry to ascertain structure, can be used to identifyindividual biomolecules selected by the method, for example.

Other methods to identify biomolecular binders of a target cellcomponent can be used. For example, the two-hybrid system or interactiontrap is an in vivo system that can be used to identify polypeptides,peptides or proteins (candidate biomolecular binders) that bind to atarget protein. In this system, both candidate biomolecular binders andtarget cell component proteins are produced as fusion proteins. Thetwo-hybrid system and variations on it have been described (U.S. Pat.No. 5,283,173 and U.S. Pat. No. 5,468,614; Golemis, E. A. et al., pages20.1.1-20.1.35 In Current Protocols in Molecular Biology, F. M. Ausubelet al., eds., John Wiley and Sons, containing supplements up throughSupplement 40, 1997; two-hybrid systems available from Clontech, PaloAlto, Calif.).

Once one or more biomolecular binders of a cell component have beenidentified, further steps can be combined with those taken to identifythe biomolecular binder, to identify those biomolecular binders thatproduce a phenotypic effect on a cell (where “a cell” can mean cells ofa cell strain or cell line).

Thus, a method for identifying a biomolecule that produces a phenotypiceffect on a first cell can comprise the steps of identifying abiomolecular binder of an isolated target cell component of the firstcell, constructing a second cell comprising the target cell componentand a regulable exogenous gene encoding the biomolecular binder, andtesting the second cell for the phenotypic effect, upon production ofthe biomolecular binder in the second cell, where the second cell can bemaintained in culture or introduced into an experimental animal. If thesecond cell shows the phenotypic effect upon intracellular production ofthe biomolecular binder, then a biomolecule that produces a phenotypiceffect on the first cell has been identified. Testing the second cell isgeneral step (2) of the invention, as the three general steps wereoutlined above.

1.7.13. Host Cells: Engineered to Control Expression

Host cells (also, “second cells” in the terminology used above) of thecell type (e.g., species of pathogenic bacteria) the target was isolatedfrom (or the gene encoding the target was originally isolated from, ifthe target is produced by recombinant methods), can be engineered toharbor a gene that can regulably express the biomolecular binder (e.g.,under an inducible or repressible promoter). The ability to regulate theexpression of the biomolecular binder is desirable because constitutiveexpression of the biomolecular binder could be lethal to the cell.

Therefore, inducible or regulated expression gives the researcher theability to control if and when the biomolecular binder is expressed. Thegene expressing the biomolecular binder can be present in one or morecopies, either on an extra chromosomal structure, such as on a single ormulticopy plasmid, or integrated into the host cell genome. Plasmidsthat provide an inducible gene expression system in pathogenic organismscan be used. For example, plasmids allowing tetracycline-inducibleexpression of a gene in Staphylococcus aureus have been developed.

1.7.14. Genes for Expression

For intracellular expression of a biomolecule to be tested for itsphenotypic effect in a eukaryotic cell (e.g., mammalian cell), the genesfor expression can be carried on plasmid-based or virus-based vectors,or on a linear piece of DNA or RNA. For examples of expression vectors,see Hosfield and Lu, Biotechniques: 306-309, 1998; Stephens and Cockett,Nucleic Acid Research 17: 7110, 1989; Wohigemuth et al, Gene Therapy, 3:503-512, 1996; Ramirez-Solis et al, Gene 87: 291-294, 1990, Dirks et al,Gene 149: 387-388, 1994; Chenaalvala et al. Current Opinion inBiotechnologies 2: 718-722, 1991; Methods in Enzymology, Volume 185, (D.V. Goeddel, ed.) Academic Press, San Diego, 1990. The genetic materialcan be introduced into cells using a variety of techniques, includingwhole cell or protoplast transformation, electroporation, calciumphosphate-DNA precipitation or DEAE-Dextran transfection, liposomemediated DNA or RNA transfer, or transduction with recombinant viral orretroviral vectors. Expression of the gene can be constitutive (e.g.,ADHI promoter for expression in S. cerevisiae (Bennetzen, J. L. andHall, B. D., J. Biol. Chem 257: 3026-3031 (1982)), or CMV immediateearly promoter and RSV LTR for mammalian expression) or inducible, asthe inducible GAL I promoter in yeast (Davis, L. I. and Fink, G. R.,Cell 61: 965-978 (1990)). A variety of inducible systems can beutilized, for example, E. coli Lac repressor/operator system and Tn 10Tet repressor/operator systems have been engineered to govern regulatedexpression in organisms from bacterial to mammalian cells. Regulatedgene expression can also be achieved by activation. For example, geneexpression governed by HIV LTR can be activated by HIV or SIV Tatproteins in human cells; GAL4 promoter can be activated by galactose ina nonglucose-containing medium. The location of the biomolecule bindergenes can be extra chromosomal or chromosomally integrated. Thechromosome integration can be mediated through homologous ornonhomologous recombinations.

For proper localization in the cells, it may be desirable to tag thebiomolecule binders with certain peptide signal sequences (for example,nuclear localization signal (NLS) sequences, mitochondria localizationsequences). Secretion sequences have been well documented in the art.

1.7.15. Fused Biomolecular Binders

For presentation of the biomolecular binders in the intracellularsystem, they can be fused N-terminally, C-terminally, or internally in acarrier protein (if the biomolecular binder is a peptide), and can befused (5′, 3′ or internally) in a carrier RNA or DNA molecule (if thebiomolecular binder is a nucleic acid). The biomolecular binder can bepresented with a protein or nucleic acid structural scaffold. Certainlinkages (e.g., a 4-glycine linker for a peptide or a stretch of A's foran RNA can be inserted between the biomolecular binder and the carrierproteins or nucleic acids.

In such engineered cells, the effect of this biomolecular binder on thephenotype of the cells can be tested, as a manifestation of the binding(implying binding to a functionally relevant site, thus, an activator,or more likely, an inhibitory) effect of the biomolecular binder on thetarget used in an in vitro binding assay as described above. Anintracellular test can not only determine which biomolecular bindershave a phenotypic effect on the cells, but at the same time can assesswhether the target in the cells is essential for maintaining the normalphenotype of the cells. For example, a culture of the engineered cellsexpressing a biomolecular binder can be divided into two aliquots. Thefirst aliquot (“test” cells) can be treated in a suitable manner toregulate (e.g., induce or release repression of, as appropriate) thegene encoding the biomolecular binder, such that the biomolecular binderis produced in the cells. The second aliquot (“control” cells) can beleft untreated so that the biomolecular binder is not produced in thecells. In a variation of this method of testing the effect of abiomolecular binder on the phenotype of the cells, a different strain ofcells, not having a gene that can express the biomolecular binder, canbe used as control cells. The phenotype of the cells in each culture(“test” and “control” cells grown under the same conditions, other thanthe expression of the biomolecular binder), can then be monitored by asuitable means (e.g., enzymatic activity, monitoring, a product of abiosynthetic pathway, antibody to test for presence of cell surfaceantigen, etc.). Where the change in phenotype is a change in growthrate, the growth of the cells in each culture (“test” and “control”cells grown under the same conditions, other than the expression of thebiomolecular binder), can be monitored by a suitable means (e.g.,turbidity of liquid cultures, cell count, etc). If the extent of growth,or rate of growth of the test cells is less than the extent of growth orrate of growth of the control cells, then the biomolecular binder can beconcluded to be an inhibitor of the growth of the cells, or abiomolecular inhibitor.

If the phenotype of the test cells is altered relative to that of thecontrol cells, then the biomolecular binder can be concluded to be onethat causes a phenotypic effect. In an optional additional test,isolated target cell component having a known function (e.g., an enzymeactivity) can be tested for modulation of this known function in thepresence of biomolecular binder under conditions conducive to binding ofthe biomolecular binder to the target cell component. Positive resultsin these tests should encourage the investigator to continue in the drugdiscovery process with efforts to find a more stable compound (than apeptide, polypeptide or RNA biomolecule) that mimics the bindingproperties of the biomolecular binder on the tested target cellcomponent.

1.7.16. Engineering Strain of Cells

A further test can, again, employ an engineered strain of cells thatcomprise both the target cell component and one or more genes encoding abiomolecule tested to be a biomolecular binder of the target cellcomponent. The cells of the cell strain can be tested in animals to seeif regulable expression of the biomolecular binder in the engineeredcells produces an observable or testable change in phenotype of thecells. Both the “in culture” test for the effect of intracellularexpression of the biomolecular binder and the “in animal” test(described below) for the effect of intracellular expression of thebiomolecular binder can be applied not only towards drug discovery inthe categories of antimicrobials and anticancer agents, but also towardsthe discovery of therapeutic agents to treat inflammatory diseases,cardiovascular diseases, diseases associated with metabolic pathways,and diseases associated with the central nervous system, for example.

Where the engineered strain of cells is a strain of pathogen cells ortumor cells, the object of the test is to see whether production of thebiomolecular binder in the engineered strain inhibits growth of thesecells after their introduction into an animal by the engineeredpathogen. Such a test can not only determine which biomolecular bindersare inhibitors of growth of the cells, but at the same time can assesswhether the target in the cells is essential for maintaining growth ofthe cells (infection, for a pathogenic organism) in a host mammal.Suitable animals for such an experiment are, for example, mammals suchas mice, rats, rabbits, guinea pigs, dogs, pigs, and the like. Smallmammals are preferred for reasons of convenience.

The engineered cells are introduced into one or more animals (“test”animals) and into one or more animals in a separate group (“control”animals) by a route appropriate to cause symptoms of systemic or localgrowth of the engineered cells.

The route of introduction may be, for example, by oral feeding, byinhalation, by subdermal, intramuscular, intravenous, or intraperitonealinjection as appropriate to the desired result.

After the cell strain has been introduced into the test and controlanimals, expression of the gene encoding the biomolecular binder isregulated to allow production of the biomolecular binder in theengineered pathogen cells. This can be achieved, for instance, byadministering to the test animals a treatment appropriate to theregulation system built into the cells, to cause the gene encoding thebiomolecular binder to be expressed. The same treatment is notadministered to the control animals, but the conditions under which theyare maintained are otherwise identical to those of the test animals. Thetreatment to express the gene encoding the biomolecular binder can bethe administration of an inducer substance (where expression of thebiomolecular binder or gene is under the control of an induciblepromoter) or the functional removal of a repressor substance (whereexpression of the biomolecular binder gene is under the control of arepressible promoter).

After such treatment, the test and control animals can be monitored fora phenotypic effect in the introduced cells. Where the introduced cellsare constructed pathogen cells, the animals can be monitored for signsof infection (as the simplest endpoint, death of the animal, but alsoe.g., lethargy, lack of grooming behavior, hunched posture, not eating,diarrhea or other discharges; bacterial titer in samples of blood orother cultured fluids or tissues). In the case of testing engineeredtumor cells, the test and control animals can be monitored for thedevelopment of tumors or for other indicators of the proliferation ofthe introduced engineered cells. If the test animals are observed toexhibit less growth of the introduced cells than the control animals,then the biomolecule can be also called a biomolecular inhibitor ofgrowth, or biomolecular inhibitor of infection, as appropriate, as itcan be concluded that the expression in vivo of the biomolecularinhibitor is the cause of the relative reduction in growth of theintroduced cells in the test animals.

1.7.17. In Vitro Assays

Further steps of the procedure involve in vitro assays to identify oneor more compounds that have binding and activating or inhibitoryproperties that are similar to those of the biomolecules which have beenfound to have a phenotypic effect, such as inhibition of growth. Thatis, compounds that compete for binding to a target cell component withthe biomolecule would then be structural analogs of the biomolecules.Assays to identify such compounds can take advantage of known methods toidentify competing molecules in a binding assay. These steps comprisegeneral step (3) of the method.

In one method to identify such compounds, a biomolecular inhibitor (oractivator) can be contacted with the isolated target-cell component toallow binding, one or more compounds can be added to the milieucomprising the biomolecular inhibitor and the cell component underconditions that allow interaction and binding between the cell componentand the biomolecular inhibitor, and any biomolecular inhibitor that isreleased from the cell component can be detected.

1.7.18. Fluorescence

One suitable system that allows the detection of released biomolecularinhibitor (or activator) is one in which fluorescence polarization ofmolecules in the milieu can be measured. The biomolecular inhibitor canhave bound to it a fluorescent tag or label such as fluorescein orfluorescein attached to a linker.

Assays for inhibition of the binding of the biomolecular inhibitor tothe cell component can be done in microtiter plates to conveniently testa set of compounds at the same time. In such assays, a majority of thefluorescently labeled biomolecular inhibitor must bind to the protein inthe absence of competitor compound to allow for the detection of smallchanges in the bound versus free probe population when a compound whichis a competitor with a biomolecular inhibitor is added (B. A. Lynch, etal., Analytical Biochemistry 247: 77-82 (1997)). If a compound competeswith the biomolecular inhibitor for a binding site on the target cellcomponent, then fluorescently labeled biomolecular inhibitor is releasedfrom the target cell component, lowering the polarization measured inthe milieu.

1.7.19. Radioactive Isotope

In a further method for identifying one or more compounds that competewith a biomolecular inhibitor (or activator) for a binding site on atarget cell component, the target cell component can be attached to asolid support, contacted with one or more compounds, and contacted withthe biomolecular inhibitor. One or more washing steps can be employed toremove biomolecular inhibitor and compound not bound to the cellcomponent. Either the biomolecular inhibitor bound to the target cellcomponent or the compound bound to the target cell component can bemeasured. Detection of biomolecular inhibitor or compound bound to thecell compound can be facilitated by the use of a label on eithermolecule type, wherein the label can be, for instance, a radioactiveisotope either incorporated into the molecule itself or attached as anadduct, streptavidin or biotin, a fluorescent label or a substrate foran enzyme that can produce from the substrate a colored or fluorescentproduct. An appropriate means of detection of the labeled biomolecularinhibitor or compound moiety of the biomolecular inhibitor-cellcomponent complex or the compound-cell component complex can be applied.For example, a scintillation counter can be used to measureradioactivity. Radio labeled streptavidin or biotin can be allowed tobind to biotin or streptavidin, respectively, and the resultingcomplexes detected in a scintillation counter. Alkaline phosphataseconjugated to streptavidin can be added to a biotin-labeled biomolecularinhibitor or compound. Detection and quantitation of a biotin-labeledcomplex can then be by addition of pNPP substrate of alkalinephosphatase and detection by spectrophotometry, of a product whichabsorbs UV light at a wavelength of 405 nm. A fluorescent label can alsobe used, in which case detection of fluorescent complexes can be by afluorometer. Models are available that can read multiple samples, as ina microtiter plate.

For example, in one type of assay, the method for identifying compoundscomprises attaching the target cell component to a solid support,contacting the biomolecular inhibitor with the target cell componentunder conditions suitable for binding of the biomolecular inhibitor tothe cell component, removing unbound biomolecular inhibitor from thesolid support, contacting one or more compounds (e.g., a mixture ofcompounds) with the cell component under conditions suitable for bindingof the biomolecular inhibitor to the cell component, and testing forunbound biomolecular inhibitor released from the cell component, wherebyif unbound biomolecular inhibitor is detected, one or more compoundsthat displace or compete with the biomolecular inhibitor for aparticular site on the target cell component have been identified.

Other methods for identifying compounds that are competitive binderswith the biomolecule for a target can employ adaptations of fluoresencepolarization methods. See, for instance, Anal. Biochem. 253(2): 210-218(1997), Anal. Biochem. 249(1): 29-36 (1997), BioTechniques 17(3):585-589 (1994) and Nature 373: 254-256 (1995).

Those compounds that bind competitively to the target cell component canbe considered to be drug candidates. Further appropriate testing canconfirm that those compounds which bind competitively with biomolecularinhibitors (or activators) possess the same activity as seen in anintracellular test of the effect of the biomolecular inhibitor oractivator upon the phenotype of cells. Derivatives of these compoundshaving modifications to confer improved solubility, stability, etc., canalso be tested for a desired phenotypic effect.

1.7.20. Combining Steps

Combining steps for testing the phenotypic effects of a biomolecule, ascan be produced in an intracellular test, with steps for identifyingcompounds that compete with the biomolecule for sites on a target cellcomponent, yields a method for identifying a compound which is afunctional analog of a biomolecule which produces a phenotypic effect ona cell. These steps can be to test, for the phenotypic effect, either inculture or in an animal model, or in both, a cell which produces abiomolecule by regulable expression of an exogenous gene in the cell,and to identify, if the biomolecule caused the phenotypic effect, one ormore compounds that compete with the biomolecule for binding to a targetcell component. If a compound is found to compete with the biomoleculefor binding to the target cell component, then the compound is afunctional analog of a biomolecule which produces a phenotypic effect onthe cell. Such a functional analog can cause qualitatively a similareffect on the cell, but to a similar degree, lesser degree or greaterdegree than the biomolecule.

1.7.21. Method for Determining Whether a Target Component of a Cell isEssential to Producing a Phenotypic Effect on the Cell

A further embodiment of the invention combining general steps (1) and(2) is a method for determining whether a target component of a cell isessential to producing a phenotypic effect on the cell, comprisingisolating the target component from the cell, identifying a biomolecularbinder of the isolated target component of the cell, constructing asecond cell comprising the target component and a regulable, exogenousgene encoding the biomolecular binder, and testing the second cell inculture for an altered phenotypic effect, upon production of thebiomolecular binder in the second cell, whereby, if the second cellshows the altered phenotypic effect upon production of the bimolecularbinder, then the target component of the first cell is essential toproducing the phenotypic effect on the first cell.

1.7.22. Inhibit the Proliferation of the Cells

The methods described herein are well suited to the identification ofcompounds that can inhibit the proliferation of the cells of infectiousagents such as bacteria, fungi and the like. In addition, a proceduresuch as the one outlined below can be used in the identification ofcompounds to inhibit the proliferation of cancer cells. The twoprocedures described below further illustrate the use of the methodsdescribed herein and would provide proof of principle of these methodswith a known target for anticancer therapy.

Mammalian dihydrofolate reductase (DHFR) is a proven target foranticancer therapy. Methotrexate (MTX) is one of many existing drugsthat inhibit DHFR. It is widely used for anticancer chemotherapy.

NIH 3T3 is a mouse fibroblast cell line that is able to developspontaneous transformed cells when cultured in low concentration (2%) ofcalf serum in molecular, cellular and developmental biology medium 402(MCDB) (M. Chow and H. Rubin, Proc., Natl. Acad. Sci. USA 95(8):4550-4555 (1998)). The transformed cells, which can be selectivelyinhibited by MTX (Chow and Rubin), are isolated.

Both the normal and transformed NIH3T3 cells are transfected withpTet-On plasmid (Clontech; Palo Alto, Calif.). Stable cell lines thatexpress high levels of reverse tetracycline-control led activator (rtTA)are isolated and characterized for their normal or transformed phenotype(Chow and Rubin).

The DHFR gene (Genbank Accession # L26316) from the NIH 3T3 cell line isamplified by reverse transcription-PCR (RT-PCR) using poly A¹RNAisolated from NIH 3T3 cells (Sambrook, J. et al., Molecular Cloning: ALaboratory Manual, 2nd edition, Cold Spring Harbor Laboratory Press,1989). Active DHFR is expressed using the BacPAK Baculovirus ExpressionSystem (Clontech) or other appropriate systems. The expressed DHFR ispurified and biotinylated and subjected to peptide binder identificationas exemplified for bacterial proteins. The identified peptides arebiochemically characterized for in vitro inhibition of DHFR activity.Peptides that inhibit DHFR are identified. A nucleic acid encoding eachpeptide can be cloned into a vector such as pGEX4T2 (Pharmacia) to yielda vector which encodes a fusion polypeptide having the peptide fused tothe N-terminus of GST. This can also be done by PCR amplification asexemplified herein for the peptide Pro-3. The fusion genes are clonedinto plasmid pTRE (Clontech) for regulated expression. The constructedplasmid or the vector is cotransfected with pTK-Hyg into the stable NIH3T3 cell line that expresses rtTA. The resulting cell lines, termed3T3N-VITA (normal 3T3 cells that express rtTA and the DHFR inhibitorypeptides), 3T3T-VITA (transformed 3T3 cells that express rtTA and theDHFR inhibitory peptides), or 3T3T-VITA control (transformed 3T3 cellsthat express rtTA and GST), are characterized for their normal ortransformed phenotype (loss of contact inhibition, change in morphology,immortalization, etc.). 10²-10¹ of 3T3T-VITA or 3T3T-VITA control cellsare mixed with 10⁵ 3T3N-VITA and are grown in MCD 402 medium with 10%calf serum at 37′C for three days. Tetracycline is added to the mediumto a final concentration of 0 to 1 ug/ml. In a control, 200 nM of MTX isadded. The cultures are incubated for an additional eight days, and thenumber of foci formed are counted as described by M. Chow and H. Rubin,Proc. Natl. Acad. Sci. USA 95(8): 45504555 (1998). Peptides thatspecifically inhibit foci formation of 3T3 transformed cells areidentified.

A murine model of fibroblastoma (Kogerman, P. et al., Oncogene (12):1407-1416 (1997)) is used for evaluating the DHFR/peptide combinationfor identification of compounds for cancer therapy. Various amounts of3T3T-VITA or 3T3T-VITA control cells (10³, 10⁴, 10⁵, 10⁶ cells) areinjected subcutaneously into 5 groups (10 in each group) of athymic nudemice (4-6 weeks old, 18-22 g) to determine the minimal dose needed fordevelopment of fibroblastomas in all of the tested animals. Upondetermination of the minimal tumorigenic dose, 6 groups of athymic nudemice (10 each) are injected subcutaneously (s.c.) with the minimaltumorigenic dose for 3T3T-VITA or 3T3T-VITA control cells to developfibroblastoma. One week after injection, group I mice start receivingMTX s.c. at 2 mg/kg/day as positive control, group 2 to 5 startreceiving 1, 2, 5, or 10 mg/kg/day of tetracycline, group 6 startreceiving saline (vehicle) as control. Five weeks after the introductionof cells, all of the mice are sacrificed and tumors are removed fromthem. Tumor mass is measured and compared among the groups.

An effective peptide identified by these in vivo experiments can be usedfor screening libraries of compounds to identify those compounds thatcompetitively bind to DHFR. One mechanism of tumorigenesis isoverexpression of proto-oncogenes such as Ha-ras (Reviewed by Suarez, H.G., Anticancer Research 9(5): 1331-1343 (1989)).

Compounds that inhibit the activities of the products of suchproto-oncogenes can be used for cancer chemotherapy. What follows is afurther illustration of the methods described herein, as applied tomammalian cells.

Transgenic mice that overexpress human Ha-ras have been produced. Suchtransgenic mice develop salivary and/or mammary adenocarcinomas(Nielsen, L. L. et al, In Vivo 8(5): 1331-1343 (1994)). Secondarytransgenic mice that express rtTA can be generated using the pTet-Onplasmid from Clontech.

Human Ha-ras open reading frame cDNA (Genbank Accession #GO0277) isamplified by RT-PCR using polyA-RNA isolated from human mammary gland orother tissues. Active Ha-ras is expressed using the BacPAK BaculovirusExpression System (Clontech) or other appropriate systems. The expressedHa-ras is purified and biotinylated and subjected to peptide binderidentification as exemplified herein for bacterial proteins as targetcell components. The identified peptides are biochemically characterizedfor in vitro inhibition of Ha-ras GTPase activity.

Peptides that inhibit Ha-ras are cloned into plasmid pTPE (Clontech) forregulated expression as an N-terminal fusion of GST. Such constructs areused to generate tertiary transgenic mice using the secondary transgenicmice. Transgenic mice that are able to overexpress peptide genes areidentified by Northern and Western analysis. Control mice that expressGST are also identified.

Various doses of tetracycline are administered to the tertiarytransgenic mice by s.c. or i.p. injection before or after tumor onset.Prevention or regression of tumors resulting from expression of thepeptide genes are analyzed as described above for murine fibroblastoma.

Peptides found to be effective in in vivo experiments will be used toscreen compounds that inhibit human Ha-ras activity for cancer therapy.

1.7.23. Disease Targets

The method of the invention can be applied more generally to mammaliandiseases caused by: (1) loss or gain of protein function, (2)over-expression or loss of regulation of protein activity. In each casethe starting point is the identification of a putative protein target ormetabolic pathway involved in the disease. The protocol can sometimesvary with the disease indication, depending on the availability of cellculture and animal model systems to study the disease. In all cases theprocess can deliver a validated target and assay combination to supportthe initiation of drug discovery.

Appropriate disease indications include, but are not limited to,Alzheimer's, arthritis, cancer, cardiovascular diseases, central nervoussystem disorders, diabetes, depression, hypertension, inflammation,obesity and pain.

Appropriate protein targets putatively linked to disease indicationsinclude, but are not limited to (1) the leptin protein, putativelylinked to obesity and diabetes; (2) a mitogen-activated protein kinaseputatively linked to arthritis, osteoporosis and atherosclerosis; (3)the interleukin-1 beta converting protein putatively linked toarthritis, asthma and inflammation; (4) the caspase proteins putativelylinked to neurodegenerative diseases such as Alzheimer's, Parkinson'sand stroke, and (5) the tumor necrosis factor protein putatively linkedto obesity and diabetes. Appropriate protein targets include also, butare not limited to, enzymes catalyzing the following types of reactions:(1) oxido-reductases, (2) transferases, (3) hydrolases, (4) lyases, (5)isomerases, and (6) ligases.

The arachidonic acid pathway constitutes one of the main mechanisms forthe production of pain and inflammation. The pathway produces differentclasses of end products, including the prostaglandins, thromboxane andleukotrienes.

Prostaglandins, an end product of cyclooxygenase metabolism, modulateimmune function, mediate vascular phases of inflammation and are potentvasodilators. The major therapeutic action of aspirin and othernon-steroidal anti-inflammatory drugs (NSAIDs) is proposed to beinhibition of the enzyme cyclooxygenase (COX). Anti-inflammatorypotencies of different NSAIDs have been shown to be proportional totheir action as COX inhibitors. It has also been shown that COXinhibition produces toxic side effects such as erosive gastritis andrenal toxicity. The knowledge base regarding the toxic side effects ofCOX inhibitors has been gained through years of monitoring humantherapies and human suffering. Two kinds of COX enzymes are now known toexist, with inhibition of COX 1related to toxicity, and inhibition ofCOX2 related to reduction of inflammation. Thus, selective COX2inhibition is a desirable characteristic of new anti-inflammatory drugs.The method of the invention can provide a route from identification ofpotential drug targets to validating these targets (for example, COX1and COX2) as playing a role in disease (pain and inflammation) to anexamination of the phenotype for the inhibition of one or both targetisozymes without human suffering. Importantly, this information can becollected in vivo.

As an alternative strategy, the method of the invention can be used todefine the phenotype of “genes of unknown function” obtained fromvarious human genome sequencing projects or to assess the phenotyperesulting, from inhibition of one isozyme subtype or one member of afamily of related protein targets.

1.5. Definitions

Target: (also, “target component of a cell,” or “target cell component”)a constituent of a cell which contributes to and is necessary for theproduction or maintenance of a phenotype of the cell in which it isfound. A target can be a single type of molecule or can be a complex ofmolecules. A target can be the product of a single gene, but can also bea complex comprising more than one gene product (for example, an enzymecomprising alpha and beta subunits, mRNA, tRNA, ribosomal RNA or aribonucleoprotein particle such as a snRNP). Targets can be the productof a characterized gene (gene of known function) or the product of anuncharacterized gene (gene of unknown function).

Target Validation: the process of determining whether a target isessential to the maintenance of a phenotype of the cell type in whichthe target normally occurs. For example, for pathogenic bacteria,researchers developing antimicrobials want to know if a compound whichis potentially an antimicrobial agent not only binds to a target invitro, but also binds to, and modulates the function of, a target in thebacteria in vivo, and especially under the conditions in which thebacteria are producing an infection—those conditions under which theantimicrobial agent must work to inhibit bacterial growth in an infectedanimal or human. If such compounds can be found that bind to a target invitro and alter the target's function in cells resulting in an alteredphenotype, as found by testing cells in culture and/or as found bytesting cells in an animal, then the target is validated.

Phenotypic Effect: a change in an observable characteristic of a cellwhich can include, e.g., growth rate, level or activity of an enzymeproduced by the cell, sensitivity to various agents, antigeniccharacteristics, and level of various metabolites of the cell. Aphenotypic effect can be a change away from wild type (normal)phenotype, or can be a change towards wild type phenotype, for example.

A phenotypic effect can be the causing or curing of a disease state,especially where mammalian cells are referred to herein. For cells of apathogen or tumor cells, especially, a phenotypic effect can be theslowing of growth rate or cessation of growth.

Biomolecule: a molecule which can be produced as a gene product in cellsthat have been appropriately constructed to comprise one or more genesencoding the biomolecule. Preferably, production of the biomolecule canbe turned on, when desired, by an inducible promoter. A biomolecule canbe a peptide, polypeptide, or an RNA or RNA oligonucleotide, a DNA orDNA oligonucleotide, but is preferably a peptide. The same biomoleculescan also be made synthetically. For peptides, see Merrifield, J., J. Am.Chem. Soc. 85: 2140-2154 (1963). For instance, an Applied Biosystems 431A Peptide Synthesizer (Perkin Elmer) can be used for peptide synthesis.Biomolecules produced as gene products intracellularly are tested fortheir interaction with a target in the intracellular steps describedherein (tests performed with cells in culture and tests performed withcells that have been introduced into animals). The same biomoleculesproduced synthetically are tested for their binding to an isolatedtarget in an initial in vitro method described herein.

Synthetically produced biomolecules can also be used for a final step ofthe method for finding compounds that are competitive binders of thetarget.

Biomolecular Binder (of a target): a biomolecule which has been testedfor its ability to bind to an isolated target cell component in vitroand has been found to bind to the target.

Biomolecular Inhibitor of Growth: a biomolecule which has been testedfor its ability to inhibit the growth of cells constructed to producethe biomolecule in an “in culture” test of the effect of the biomoleculeon growth of the cells, and has been found, in fact, to inhibit thegrowth of the cells in this test in culture.

Biomolecular Inhibitor of Infection: a biomolecule which has been testedfor its ability to ameliorate the effects of infection, and has beenfound to do so. In the test, pathogen cells constructed to regulablyexpress the biomolecule are introduced into one or more animals, thegene encoding the biomolecule is regulated so as to allow production ofthe biomolecule in the cells, and the effects of production of thebiomolecule are observed in the infected animals compared to one or moresuitable control animals.

Isolated: term used herein to indicate that the material in questionexists in a physical milieu distinct from that in which it occurs innature. For example, an isolated target cell component of the inventionmay be substantially isolated with respect to the complex cellularmilieu in which it naturally occurs. The absolute level of purity is notcritical, and those skilled in the art can readily determine appropriatelevels of purity according to the use to which the material is to beput.

In many circumstances the isolated material will form part of acomposition (for example, a more or less crude extract containing othersubstances), buffer system or reagent mix. In other circumstances, thematerial may be purified to essential homogeneity, for example asdetermined by PAGE or column chromatography (for example, HPLC).

Pathogen or Pathogenic Organism: an organism which is capable of causingdisease, detectable by signs of infection or symptoms characteristic ofdisease. Pathogens can include procaryotes (which include, for example,medically significant Gram-positive bacteria such as Streptococcuspneumoniae, Enterococcus faecalis and Staphylococcus aureus,Gram-negative bacteria such as Escherichia coli, Pseudomonas aeroginosaand Klebsiella pneumoniae, and “acid-fast” bacteria such asMycobacteria, especially M. tuberculosis), eucaryotes such as yeast andfungi (for example, Candida albicans and Aspergillus fumigatus) andparasites. It should be recognized that pathogens can include suchorganisms as soil-dwelling organisms and “normal flora” of the skin, gutand orifices, if such organisms colonize and cause symptoms of infectionin a human or other mammal, by abnormal proliferation or by growth at asite from which the organism cannot usually be cultured.

SECTION 2. WHOLE CELL ENGINEERING USING REAL-TIME METABOLIC FLUXANALYSIS Technical Field

In one embodiment, the present invention provides methods for whole cellengineering, cell biology and molecular biology. In particular, theinvention is directed to methods for whole cell engineering of new andmodified phenotypes by using “on-line” or “real-time” metabolic fluxanalysis.

Background

In one embodiment of this invention, whole cell metabolic flux analysisis a “horizontal” or “holistic” approach to study the metabolism, or“metabolome,” of an organism. A whole cell “horizontal” metabolomeapproach studies the expression and function of all of the genes of anorganism simultaneously. By using this whole cell approach to study acell's metabolism, it is possible to get a complete snapshot of thewhole cell's transcriptome (the expressed transcripts, or mRNA messages)and proteome (the expressed polypeptides). However, such snapshots arestatic pictures of one aspect of a cell's physiology and metabolism.Development of a means to dynamically monitor many different parametersin a cell culture would be much more effective in detecting new oraltered cell phenotypes.

Summary

One embodiment of this invention provides a method for whole cellengineering of new or modified phenotypes by using real-time metabolicflux analysis, the method comprising the following steps: (a) making amodified cell by modifying the genetic composition of a cell; (b)culturing the modified cell to generate a plurality of modified cells;(c) measuring at least one metabolic parameter of the cell by monitoringthe cell culture of step (b) in real time; and, (d) analyzing the dataof step (c) to determine if the measured parameter differs from acomparable measurement in an unmodified cell under similar conditions,thereby identifying an engineered phenotype in the cell using real-timemetabolic flux analysis.

In one aspect, the genetic composition of the cell is modified by amethod comprising addition of a nucleic acid to the cell. One or morenucleic acids can be added at the same time, or, in series. The geneticcomposition of the cell can be modified by addition of a nucleic acidheterologous to the cell, or, a nucleic acid homologous to the cell. Thehomologous nucleic acid can comprise a modified homologous nucleic acid,such as a modified homologous gene. The coding sequence ortranscriptional regulatory sequence of a gene can be modified.Alternatively, the genetic composition of the cell can be modified by amethod comprising deletion of a sequence or modification of a sequencein the cell. The genetic composition of the cell can be modified by amethod comprising modifying or knocking out the expression of a gene.

The method can further comprising selecting a cell comprising a newlyengineered phenotype. The selected cell can be isolated. The method canfurther comprise culturing the selected or isolated cell, therebygenerating a new cell strain or cell line comprising a newly engineeredphenotype. The methods can further comprise isolating a cell comprisinga newly engineered phenotype.

Any phenotype can be added or modified. For example, a phenotype can bespecifically targeted for change or addition. Thus, specificheterologous genes can be inserted or specific homologous genes can bestochastically or non-stochastically modified. For example, the newlyengineered phenotype can be, e.g., an increased or decreased expressionor amount of a polypeptide, an increased or decreased amount of an mRNAtranscript, an increased or decreased expression of a gene, an increasedor decreased resistance or sensitivity to a toxin, an increased ordecreased resistance use or production of a metabolite, an increased ordecreased uptake of a compound by the cell, an increased or decreasedrate of metabolism, and an increased or decreased growth rate.

The newly engineered phenotype can a stable phenotype. In anotheraspect, it can be a transient or an inducible phenotype. In one aspect,modifying the genetic composition of a cell comprises insertion of aconstruct into the cell, wherein construct comprises a nucleic acidoperably linked to a constitutively active promoter. Alternatively,modifying the genetic composition of a cell can comprise insertion of aconstruct into the cell, wherein construct comprises a nucleic acidoperably linked to an inducible promoter. The nucleic acid added to thecell can be stably inserted into the genome of the cell. Alternatively,the nucleic acid added to the cell can propagate as an episome in thecell.

In one aspect, the nucleic acid added to the cell can encode a peptideor a polypeptide. The polypeptide can comprise a homologous polypeptide,such as a modified homologous polypeptide. Alternatively, thepolypeptide can comprise a heterologous polypeptide. The nucleic acidadded to the cell can encode a transcript comprising a sequence that isantisense to a homologous transcript. In one aspect, modifying thegenetic composition of the cell can comprise increasing or decreasingthe expression of an mRNA transcript. Modifying the genetic compositionof the cell can comprise increasing or decreasing the expression of apolypeptide, a lipid, a mono- or poly-saccharide or a nucleic acid.

In one aspect, modifying the homologous gene can comprise knocking outexpression of the homologous gene. Modifying the homologous gene cancomprise increasing the expression of the homologous gene. The genemodification can be random, or stochastic, or, non-random, or targeted,i.e., non-stochastic.

In an exemplary non-stochastic gene modification, a gene to be insertedinto a cell to modify a phenotype can be a heterologous gene or asequence-modified homologous gene, wherein the sequence modification ismade by a method comprising the following steps: (a) providing atemplate polynucleotide, wherein the template polynucleotide comprises ahomologous gene of the cell (it can also be a heterologous gene that youwish to modify); (b) providing a plurality of oligonucleotides, whereineach oligonucleotide comprises a sequence homologous to the templatepolynucleotide, thereby targeting a specific sequence of the templatepolynucleotide, and a sequence that is a variant of the homologous gene;(c) generating progeny polynucleotides comprising non-stochasticsequence variations by replicating the template polynucleotide of step(a) with the oligonucleotides of step (b), thereby generatingpolynucleotides comprising homologous gene sequence variations. Onevariation of this method has been termed “gene site-saturationmutagenesis,” “site-saturation mutagenesis,” “saturation mutagenesis” orsimply “GSSM,” and is described in further detail, below. It can be usedin combination with other mutagenization processes. See, e.g., U.S. Pat.Nos. 6,171,820; 6,238,884.

Another exemplary non-stochastic gene modification process comprisesintroduction of two or more related polynucleotides into a suitable hostcell such that a hybrid polynucleotide is generated by recombination andreductive reassortment. For example, the sequence modification of thegene to be modified (e.g., the heterologous gene or homologous gene) ismade by a method comprising the following steps: (a) providing atemplate polynucleotide, wherein the template polynucleotide comprisessequence encoding a homologous gene; (b) providing a plurality ofbuilding block polynucleotides, wherein the building blockpolynucleotides are designed to cross-over reassemble with the templatepolynucleotide at a predetermined sequence, and a building blockpolynucleotide comprises a sequence that is a variant of the homologousgene and a sequence homologous to the template polynucleotide flankingthe variant sequence; (c) combining a building block polynucleotide witha template polynucleotide such that the building block polynucleotidecross-over reassembles with the template polynucleotide to generatepolynucleotides comprising homologous gene sequence variations. Onevariation of this method has been termed “synthetic ligationreassembly,” or simply “SLR,” and is described in further detail, below.It can be used in combination with other mutagenization processes. See,e.g., U.S. Pat. No. 6,171,820.

Any cell can be engineered by the methods the invention, including,e.g., prokaryotic cells and eukaryotic cells. Bacteria, Archaebacteria,fungi, yeast, plant cells, insect cells, mammalian cells, includinghuman cells, without limitation, can be engineered by the methods theinvention. Furthermore, intracellular parasites, bacteria, viruses canbe “indirectly” engineered by culturing and monitoring of eukaryoticcells by the methods the invention, including, e.g., immunodeficiencyviruses, e.g., HIV, oncoviruses, mycobacteria, protozoan organisms(e.g., trypanosomes, such as Trypanosoma rangeli), plasmodium (e.g.,Plasmodium falciparum), toxoplasmosis (e.g., Toxoplasma gondii),Leishmania, and the like.

In practicing the methods of the invention, any metabolic parameter canbe measured. In one aspect, several different metabolic parameters areevaluated in the cell culture. The metabolic parameters can be measuredat the same time or sequentially. One exemplary metabolic parameter israte of cell growth, which can be measured by, e.g., a change in opticaldensity of the cell culture. Another exemplary metabolic parametermeasured comprises a change in the expression of a polypeptide. Changesin the expression of the polypeptide can be measured by any method,e.g., a one-dimensional gel electrophoresis, a two-dimensional gelelectrophoresis, a tandem mass spectography, an RIA, an ELISA, animmunoprecipitation and a Western blot.

In one aspect, the measured metabolic parameter comprises a change inexpression of at least one transcript, or, the expression of atranscript of a newly introduced gene. The change in expression of thetranscript can be measured by a method selected from the groupconsisting of a hybridization, a quantitative amplification and aNorthern blot. The transcript expression can be measured byhybridization of a sample comprising transcripts of a cell or nucleicacid representative of or complementary to transcripts of a cell byhybridization to immobilized nucleic acids on an array.

In one aspect, the measured metabolic parameter comprises a measurementof a metabolite, including primary and secondary metabolites. Forexample, the measured metabolic parameter can comprise an increase or adecrease in a primary or a secondary metabolite. The secondarymetabolite can be selected from the group consisting of a glycerol and amethanol. The measured metabolic parameter can comprise an increase or adecrease in an organic acid, such as an acetate, a butyrate, a succinateand an oxaloacetate.

In one aspect, the measured metabolic parameter comprises an increase ora decrease in intracellular pH, or, extracellular pH in a culturemedium. The increase or a decrease in intracellular pH can measured byintracellular application of a dye; the change in fluorescence of thedye can be measured over time. In one aspect, the measured metabolicparameter comprises gas exchange rate measurements.

In one aspect, the measured metabolic parameter comprises an increase ora decrease in synthesis of DNA or RNA over time. The increase or adecrease in synthesis, or accumulation, or decay, of DNA or RNA overtime can be measured by intracellular application of a dye; the changein fluorescence of the dye can be measured over time.

In one aspect, the measured metabolic parameter comprises an increase ora decrease in uptake of a composition. The composition can be ametabolite, such as a monosaccharide, a disaccharide, a polysaccharide,a lipid, a nucleic acid, an amino acid and a polypeptide. Thesaccharide, disaccharide or polysaccharide can comprise a glucose or asucrose. The composition can also be an antibiotic, a metal, a steroidand an antibody.

In one aspect, the measured metabolic parameter comprises an increase ora decrease in the secretion of a byproduct or a secreted composition ofa cell. The byproduct or secreted composition can be a toxin, alymphokine, a polysaccharide, a lipid, a nucleic acid, an amino acid, apolypeptide and an antibody.

In one aspect of the methods, the real time monitoring simultaneouslymeasures a plurality of metabolic parameters. The real time monitoringof a plurality of metabolic parameters can comprise use of a Cell GrowthMonitor device. The Cell Growth Monitor device can be a WedgewoodTechnology, Inc., Cell Growth Monitor model 652, or similar model orvariation thereof. In one aspect, the real time simultaneous monitoringmeasures uptake of substrates, levels of intracellular organic acids andlevels of intracellular amino acids. The real time simultaneousmonitoring can measure: uptake of glucose; levels of acetate, butyrate,succinate or oxaloacetate; and, levels of intracellular natural aminoacids.

In one aspect, the method further comprises use of acomputer-implemented program to real time monitor the change in measuredmetabolic parameters over time. The computer-implemented program cancomprise a computer-implemented method as set forth in FIG. 28. Thecomputer-implemented method can comprise metabolic network equations.These computer-implemented method can also comprise a pathway analysis,an error analysis, such as a weighted least squares solution, and a fluxestimation. The computer-implemented method can further comprises apreprocessing unit to filter out the errors for the measurement beforethe metabolic flux analysis.

The details of one or more aspects of the invention are set forth in theaccompanying drawings and the description below. Other features,objects, and advantages of the invention will be apparent from thedescription and drawings, and from the claims.

All publications, GenBank Accession references (sequences), ATCCDeposits, patents and patent applications cited herein are herebyexpressly incorporated by reference for all purposes.

Brief Description Of The Drawings

FIG. 28 is a schematic illustrating an exemplary metabolic flux analysis(MFA) procedure of the invention.

Detailed Description

In one embodiment, the invention provides novel methods for whole cellengineering of new and modified phenotypes by using “on-line” or“real-time” metabolic flux analysis. In practicing the methods of theinvention, as a first step, a cell is modified by changing the geneticcomposition of the cell. The modification can be random, i.e.,stochastic, or, by non-stochastic methods, as described herein. Specificgenes or specific metabolic pathways can be targeted for modification.

In one aspect, the second step of the methods of the invention comprisesculturing the modified cell to generate a plurality of modified cells.The cells can be cultured by any means, for example, in cell culture,such as a tissue culture, by fermentation or tissue culture reactors, orin a cell growth monitor device.

In one aspect, the next step of the methods comprises measuring at leastone metabolic parameter of the cell in real time. In one aspect, aplurality of metabolome parameters are simultaneously measured. Thus,one or several devices can be used to monitor and measure metabolicparameters. For example, a cell growth monitor devices can measure aplurality of metabolic parameters of the cells in culture in real time.One example is the Wedgewood Technology, Inc. (San Carlos, Calif.), CellGrowth Monitor model 652™, as discussed below.

Finally, in one embodiment, the methods comprise analyzing these data ofto determine if the measured parameters differ from a comparablemeasurement in an unmodified cell under similar conditions, or, changeover time, thereby identifying an engineered phenotype in the cell usingreal-time metabolic flux analysis. For example, the parameter can behigher, lower or change at a rate that differs from a wild type cell orcell culture. It is not necessary to simultaneously monitor anunmodified cell or cell culture in real time to determine if and/or whatphenotypic modifications result from the modification of the cell'sgenetic composition. Data and information already known can be used as areference.

In one aspect of the invention, the methods further comprise use of acomputer-implemented program to real time monitor the change in measuredmetabolic parameters over time and the analyze and display the resultingprocessed data. One exemplary computer-implemented program comprises acomputer-implemented method as set forth in FIG. 1. In this and othercomputer-implemented methods that can be used, the paradigm comprisesuse of metabolic network equations, metabolic pathway analyses, erroranalysis, such as a weighted least squares solution to give a fluxestimation and the like.

In one aspect of the invention, a nucleic acid (or, the nucleic acid)responsible for the altered phenotype is identified, re-isolated, againmodified (e.g., either stochastically or non-stochastically), reinsertedinto the cell, and the process of real-time metabolic flux analysis isiteratively repeated. The process can be iteratively repeated until adesired phenotype is engineered. For example, a plant cell and plantcell culture is subjected to iterative repetition of the methods of theinvention until a new plant cell is made that comprises a desired newphenotype, e.g., enhanced growth, nutritional value or insect or droughtresistance, or all or some of these characteristics. A pathogenicmicroorganism can be subjected to iterative repetition of the methods ofthe invention until it becomes non-pathogenic. A microorganism can beengineered to become lethal to another organism, such as an insect, or,to produce a variety of antibiotics or other compositions.Microorganisms can be subjected to iterative repetition of the methodsof the invention to engineer, e.g., increased yield of desired products,removal of unwanted co-metabolites, improved utilization of inexpensivecarbon and nitrogen sources, and adaptation to fermentor/bioreactorgrowth conditions, increased production of a primary metabolite,increased production of a secondary metabolite, increased tolerance toacidic conditions, increased tolerance to basic conditions, increasedtolerance to organic solvents, increased tolerance to high saltconditions and increased tolerance to high or low temperatures.

A complete biosynthetic pathway can be inserted into a cell. Any cellphenotype can be modified or any phenotype can be added to a cell usingthe methods of the invention, without limitation. The invention can bepracticed in combination with other methods for inserting and screeningfor metabolic pathways, see, e.g., U.S. Pat. No. 6,268,140, whichdescribes producing and screening combinatorial metabolic libraries ofmultimeric proteins, or, U.S. Pat. No. 5,712,146, which describesvectors encoding polyketide synthases which in turn catalyze theproduction of a variety of polyketides.

Definitions

Unless defined otherwise, all technical and scientific terms used hereinhave the meaning commonly understood by a person skilled in the art towhich this invention belongs. As used herein, the following terms havethe meanings ascribed to them unless specified otherwise.

The terms “array” or “microarray” or “biochip” or “chip” as used hereinis a plurality of target elements, each target element comprising adefined amount of one or more polypeptides or nucleic acids immobilizedonto a defined area of a substrate surface, as discussed in furtherdetail, below.

As used herein, the terms “computer” and “processor” are used in theirbroadest general contexts and incorporate all such devices, as describedin detail, below.

The term “saturation mutagenesis” or “GSSM” includes a method that usesdegenerate oligonucleotide primers to introduce point mutations into apolynucleotide, as described in detail, below.

The term “optimized directed evolution system” or “optimized directedevolution” includes a method for reassembling fragments of relatednucleic acid sequences, e.g., related genes, and explained in detail,below.

The term “synthetic ligation reassembly” or “SLR” includes a method ofligating oligonucleotide fragments in a non-stochastic fashion, andexplained in detail, below.

The term “antibody” includes a peptide or polypeptide derived from,modeled after or substantially encoded by an immunoglobulin gene orimmunoglobulin genes, or fragments thereof, capable of specificallybinding an antigen or epitope, see, e.g. Fundamental Immunology, ThirdEdition, W. E. Paul, ed., Raven Press, N.Y. (1993); Wilson (1994) J.Immunol. Methods 175: 267-73; Yarmush (1992) J. Biochem. Biophys.Methods 25: 85-97. The term antibody includes antigen-binding portions,i.e., “antigen binding sites,” (e.g., fragments, subsequences,complementarity determining regions (CDRs)) that retain capacity to bindantigen, including (i) a Fab fragment, a monovalent fragment consistingof the VL, VH, CL and CH1 domains; (ii) a F(ab′)2 fragment, a bivalentfragment comprising two Fab fragments linked by a disulfide bridge atthe hinge region; (iii) a Fd fragment consisting of the VH and CH1domains; (iv) a Fv fragment consisting of the VL and VH domains of asingle arm of an antibody, (v) a dAb fragment (Ward et al., (1989)Nature 341: 544-546), which consists of a VH domain; and (vi) anisolated complementarity determining region (CDR). Single chainantibodies are also included by reference in the term “antibody.”

Generating and Manipulating Nucleic Acids

The methods of the invention include modifying the genetic compositionof a cell by addition of a heterologous nucleic acid into the cell ormodification of a homologous gene in the cell. Nucleic acids can beisolated from a cell, recombinantly generated or made synthetically. Thesequences can be isolated by, e.g., cloning and expression of cDNAlibraries, amplification of message or genomic DNA by PCR, and the like.In practicing the methods of the invention, homologous genes can bemodified by manipulating a template nucleic acid, as described herein.The invention can be practiced in conjunction with any method orprotocol or device known in the art, which are well described in thescientific and patent literature.

General Techniques

The nucleic acids used to practice this invention, whether RNA, cDNA,genomic DNA, vectors, viruses or hybrids thereof, may be isolated from avariety of sources, genetically engineered, amplified, and/orexpressed/generated recombinantly. Recombinant polypeptides generatedfrom these nucleic acids can be individually isolated or cloned andtested for a desired activity. Any recombinant expression system can beused, including bacterial, mammalian, yeast, insect or plant cellexpression systems.

Alternatively, these nucleic acids can be synthesized in vitro bywell-known chemical synthesis techniques, as described in, e.g., Adams(1983) J. Am. Chem. Soc. 105: 661; Belousov (1997) Nucleic Acids Res.25: 3440-3444; Frenkel (1995) Free Radic. Biol. Med. 19: 373-380;Blommers (1994) Biochemistry 33: 7886-7896; Narang (1979) Meth. Enzymol.68: 90; Brown (1979) Meth. Enzymol. 68: 109; Beaucage (1981) Tetra.Lett. 22: 1859; U.S. Pat. No. 4,458,066.

Techniques for the manipulation of nucleic acids, such as, e.g.,subcloning, labeling probes (e.g., random-primer labeling using Klenowpolymerase, nick translation, amplification), sequencing, hybridizationand the like are well described in the scientific and patent literature,see, e.g., Sambrook, ed., MOLECULAR CLONING: A LABORATORY MANUAL (2NDED.), Vols. 1-3, Cold Spring Harbor Laboratory, (1989); CURRENTPROTOCOLS IN MOLECULAR BIOLOGY, Ausubel, ed. John Wiley & Sons, Inc.,New York (1997); LABORATORY TECHNIQUES IN BIOCHEMISTRY AND MOLECULARBIOLOGY: HYBRIDIZATION WITH NUCLEIC ACID PROBES, Part I. Theory andNucleic Acid Preparation, Tijssen, ed. Elsevier, N.Y. (1993).

Nucleic acids, vectors, capsids, polypeptides, and the like can beanalyzed and quantified by any of a number of general means well knownto those of skill in the art. These include, e.g., analyticalbiochemical methods such as NMR, spectrophotometry, radiography,electrophoresis, capillary electrophoresis, high performance liquidchromatography (HPLC), thin layer chromatography (TLC), andhyperdiffusion chromatography, various immunological methods, e.g. fluidor gel precipitin reactions, immunodiffusion, immuno-electrophoresis,radioimmunoassays (RIAs), enzyme-linked immunosorbent assays (ELISAs),immuno-fluorescent assays, Southern analysis, Northern analysis,dot-blot analysis, gel electrophoresis (e.g., SDS-PAGE), nucleic acid ortarget or signal amplification methods, radiolabeling, scintillationcounting, and affinity chromatography.

Another useful means of obtaining and manipulating nucleic acids used topractice the methods of the invention is to clone from genomic samples,and, if desired, screen and re-clone inserts isolated or amplified from,e.g., genomic clones or cDNA clones. Sources of nucleic acid used in themethods of the invention include genomic or cDNA libraries contained in,e.g., mammalian artificial chromosomes (MACs), see, e.g., U.S. Pat. Nos.5,721,118; 6,025,155; human artificial chromosomes, see, e.g., Rosenfeld(1997) Nat. Genet. 15: 333-335; yeast artificial chromosomes (YAC);bacterial artificial chromosomes (BAC); P1 artificial chromosomes, see,e.g., Woon (1998) Genomics 50: 306-316; P1-derived vectors (PACs), see,e.g., Kern (1997) Biotechniques 23: 120-124; cosmids, recombinantviruses, phages or plasmids.

Amplification of Nucleic Acids

In practicing the methods of the invention, nucleic acids encodingheterologous or homologous, or modified nucleic acids, can be reproducedby, e.g., amplification. Amplification reactions can also be used toquantify the amount of nucleic acid in a sample (such as the amount ofmessage in a cell sample), label the nucleic acid (e.g., to apply it toan array or a blot), detect the nucleic acid, or quantify the amount ofa specific nucleic acid in a sample. In one aspect of the invention,message isolated from a cell or a cDNA library are amplified. Theskilled artisan can select and design suitable oligonucleotideamplification primers. Amplification methods are also well known in theart, and include, e.g., polymerase chain reaction, PCR (see, e.g., PCRPROTOCOLS, A GUIDE TO METHODS AND APPLICATIONS, ed. Innis, AcademicPress, N.Y. (1990) and PCR STRATEGIES (1995), ed. Innis, Academic Press,Inc., N.Y., ligase chain reaction (LCR) (see, e.g., Wu (1989) Genomics4: 560; Landegren (1988) Science 241: 1077; Barringer (1990) Gene 89:117); transcription amplification (see, e.g., Kwoh (1989) Proc. Natl.Acad. Sci. USA 86: 1173); and, self-sustained sequence replication (see,e.g., Guatelli (1990) Proc. Natl. Acad. Sci. USA 87: 1874); Q Betareplicase amplification (see, e.g., Smith (1997) J. Clin. Microbiol. 35:1477-1491), automated Q-beta replicase amplification assay (see, e.g.,Burg (1996) Mol. Cell. Probes 10: 257-271) and other RNA polymerasemediated techniques (e.g., NASBA, Cangene, Mississauga, Ontario); seealso Berger (1987) Methods Enzymol. 152: 307-316; Sambrook; Ausubel;U.S. Pat. Nos. 4,683,195 and 4,683,202; Sooknanan (1995) Biotechnology13: 563-564.

Modification of Nucleic Acids

In practicing the methods of the invention, the genetic composition of acell is altered by, e.g., modification of a homologous gene ex vivo,followed by its reinsertion into the cell. A homologous, heterologous orgene selected by the methods of the invention can be altered by anymeans, including, e.g., random or stochastic methods, or,non-stochastic, or “directed evolution,” methods.

Methods for random mutation of genes are well known in the art, see,e.g., U.S. Pat. No. 5,830,696. For example, mutagens can be used torandomly mutate a gene. Mutagens include, e.g., ultraviolet light orgamma irradiation, or a chemical mutagen, e.g., mitomycin, nitrous acid,photoactivated psoralens, alone or in combination, to induce DNA breaksamenable to repair by recombination. Other chemical mutagens include,for example, sodium bisulfite, nitrous acid, hydroxylanine, hydrazine orformic acid. Other mutagens are analogues of nucleotide precursors,e.g., nitrosoguanidine, 5-bromouracil, 2-aminopurine, or acridine. Theseagents can be added to a PCR reaction in place of the nucleotideprecursor thereby mutating the sequence. Intercalating agents such asproflavine, acriflavine, quinacrine and the like can also be used.

Techniques in molecular biology can be used, e.g., random PCRmutagenesis, see, e.g., Rice (1992) Proc. Natl. Acad. Sci. USA 89:5467-5471; or, combinatorial multiple cassette mutagenesis, see, e.g.,Crameri (1995) Biotechniques 18: 194-196. Alternatively, nucleic acids,e.g., genes, can be reassembled after random, or “stochastic,”fragmentation, see, e.g., U.S. Pat. Nos. 6,291,242; 6,287,862;6,287,861; 5,955,358; 5,830,721; 5,824,514; 5,811,238; 5,605,793.

Non-stochastic, or “directed evolution,” methods include, e.g.,saturation mutagenesis (GSSM), synthetic ligation reassembly (SLR), or acombination thereof. In one aspect of the invention, nucleic acids areselected, using real-time metabolic flux analysis, for conferring a newor modified phenotype on a cell, isolated, modified and reinserted intoa cell to reiterate the steps of the methods of the invention.Polypeptides encoded by isolated and/or modified nucleic acids can bescreened for an activity before their reinsertion into the cell by,e.g., using a capillary array platform. See, e.g., U.S. Pat. Nos.6,280,926; 5,939,250.

Saturation mutagenesis, or, GSSM

In one aspect of the invention, non-stochastic gene modification, a“directed evolution process,” can be used to modify a gene to beinserted into a cell to add or modify a phenotype. Variations of thismethod have been termed “gene site-saturation mutagenesis,”“site-saturation mutagenesis,” “saturation mutagenesis” or simply“GSSM.” It can be used in combination with other mutagenizationprocesses. See, e.g., U.S. Pat. Nos. 6,171,820; 6,238,884. In oneaspect, GSSM comprises providing a template polynucleotide and aplurality of oligonucleotides, wherein each oligonucleotide comprises asequence homologous to the template polynucleotide, thereby targeting aspecific sequence of the template polynucleotide, and a sequence that isa variant of the homologous gene; generating progeny polynucleotidescomprising non-stochastic sequence variations by replicating thetemplate polynucleotide with the oligonucleotides, thereby generatingpolynucleotides comprising homologous gene sequence variations.

In one aspect, codon primers containing a degenerate N,N,G/T sequenceare used to introduce point mutations into a polynucleotide, so as togenerate a set of progeny polypeptides in which a full range of singleamino acid substitutions is represented at each amino acid position,e.g., an amino acid residue in an enzyme active site or ligand bindingsite targeted to be modified. These oligonucleotides can comprise acontiguous first homologous sequence, a degenerate N,N,G/T sequence,and, optionally, a second homologous sequence. The downstream progenytranslational products from the use of such oligonucleotides include allpossible amino acid changes at each amino acid site along thepolypeptide, because the degeneracy of the N,N,G/T sequence includescodons for all 20 amino acids.

In one aspect, one such degenerate oligonucleotide (comprised of, e.g.,one degenerate N,N,G/T cassette) is used for subjecting each originalcodon in a parental polynucleotide template to a full range of codonsubstitutions. In another aspect, at least two degenerate cassettes areused—either in the same oligonucleotide or not, for subjecting at leasttwo original codons in a parental polynucleotide template to a fullrange of codon substitutions. For example, more than one N,N,G/Tsequence can be contained in one oligonucleotide to introduce amino acidmutations at more than one site. This plurality of N,N,G/T sequences canbe directly contiguous, or separated by one or more additionalnucleotide sequence(s). In another aspect, oligonucleotides serviceablefor introducing additions and deletions can be used either alone or incombination with the codons containing an N,N,G/T sequence, to introduceany combination or permutation of amino acid additions, deletions,and/or substitutions.

In one aspect, simultaneous mutagenesis of two or more contiguous aminoacid positions is done using an oligonucleotide that contains contiguousN,N,G/T triplets, i.e. a degenerate (N,N,G/T)n sequence. In anotheraspect, degenerate cassettes having less degeneracy than the N,N,G/Tsequence are used. For example, it may be desirable in some instances touse (e.g. in an oligonucleotide) a degenerate triplet sequence comprisedof only one N, where said N can be in the first second or third positionof the triplet. Any other bases including any combinations andpermutations thereof can be used in the remaining two positions of thetriplet. Alternatively, it may be desirable in some instances to use(e.g. in an oligo) a degenerate N,N,N triplet sequence.

In one aspect, use of degenerate triplets (e.g., N,N,G/T triplets)allows for systematic and easy generation of a full range of possiblenatural amino acids (for a total of 20 amino acids) into each and everyamino acid position in a polypeptide (in alternative aspects, themethods also include generation of less than all possible substitutionsper amino acid residue, or codon, position). For example, for a 100amino acid polypeptide, 2000 distinct species (i.e. 20 possible aminoacids per position X 100 amino acid positions) can be generated. Throughthe use of an oligonucleotide or set of oligonucleotides containing adegenerate N,N,G/T triplet, 32 individual sequences can code for all 20possible natural amino acids. Thus, in a reaction vessel in which aparental polynucleotide sequence is subjected to saturation mutagenesisusing at least one such oligonucleotide, there are generated 32 distinctprogeny polynucleotides encoding 20 distinct polypeptides. In contrast,the use of a non-degenerate oligonucleotide in site-directed mutagenesisleads to only one progeny polypeptide product per reaction vessel.Nondegenerate oligonucleotides can optionally be used in combinationwith degenerate primers disclosed; for example, nondegenerateoligonucleotides can be used to generate specific point mutations in aworking polynucleotide. This provides one means to generate specificsilent point mutations, point mutations leading to corresponding aminoacid changes, and point mutations that cause the generation of stopcodons and the corresponding expression of polypeptide fragments.

In one aspect, each saturation mutagenesis reaction vessel containspolynucleotides encoding at least 20 progeny polypeptide molecules suchthat all 20 natural amino acids are represented at the one specificamino acid position corresponding to the codon position mutagenized inthe parental polynucleotide (other aspects use less than all 20 naturalcombinations). The 32-fold degenerate progeny polypeptides generatedfrom each saturation mutagenesis reaction vessel can be subjected toclonal amplification (e.g. cloned into a suitable host, e.g., E. colihost, using, e.g., an expression vector) and subjected to expressionscreening. When an individual progeny polypeptide is identified byscreening to display a favorable change in property (when compared tothe parental polypeptide, such as increased affinity or avidity to anantigen), it can be sequenced to identify the correspondingly favorableamino acid substitution contained therein.

In one aspect, upon mutagenizing each and every amino acid position in aparental polypeptide using saturation mutagenesis as disclosed herein,favorable amino acid changes may be identified at more than one aminoacid position. One or more new progeny molecules can be generated thatcontain a combination of all or part of these favorable amino acidsubstitutions. For example, if 2 specific favorable amino acid changesare identified in each of 3 amino acid positions in a polypeptide, thepermutations include 3 possibilities at each position (no change fromthe original amino acid, and each of two favorable changes) and 3positions. Thus, there are 3×3×3 or 27 total possibilities, including 7that were previously examined—6 single point mutations (i.e. 2 at eachof three positions) and no change at any position.

In another aspect, site-saturation mutagenesis can be used together withanother stochastic or non-stochastic means to vary sequence, e.g.,synthetic ligation reassembly (see below), shuffling, chimerization,recombination and other mutagenizing processes and mutagenizing agents.This invention provides for the use of any mutagenizing process(es),including saturation mutagenesis, in an iterative manner.

Synthetic Ligation Reassembly (SLR)

Another non-stochastic gene modification, a “directed evolutionprocess,” that can be can be used in the methods of the invention tomodify a gene to be inserted into a cell to add or modify a phenotypehas been termed “synthetic ligation reassembly,” or simply “SLR.” SLR isa method of ligating oligonucleotide fragments togethernon-stochastically. This method differs from stochastic oligonucleotideshuffling in that the nucleic acid building blocks are not shuffled,concatenated or chimerized randomly, but rather are assemblednon-stochastically. See, e.g., U.S. patent application Ser. No.09/332,835 entitled “Synthetic Ligation Reassembly in DirectedEvolution” and filed on Jun. 14, 1999 (“U.S. Ser. No. 09/332,835”). Inone aspect, SLR comprises the following steps: (a) providing a templatepolynucleotide, wherein the template polynucleotide comprises sequenceencoding a homologous gene; (b) providing a plurality of building blockpolynucleotides, wherein the building block polynucleotides are designedto cross-over reassemble with the template polynucleotide at apredetermined sequence, and a building block polynucleotide comprises asequence that is a variant of the homologous gene and a sequencehomologous to the template polynucleotide flanking the variant sequence;(c) combining a building block polynucleotide with a templatepolynucleotide such that the building block polynucleotide cross-overreassembles with the template polynucleotide to generate polynucleotidescomprising homologous gene sequence variations.

SLR does not depend on the presence of high levels of homology betweenpolynucleotides to be rearranged. Thus, this method can be used tonon-stochastically generate libraries (or sets) of progeny moleculescomprised of over 10¹⁰⁰ different chimeras. SLR can be used to generatelibraries comprised of over 10¹⁰⁰⁰ different progeny chimeras. Thus,aspects of the present invention include non-stochastic methods ofproducing a set of finalized chimeric nucleic acid molecule shaving anoverall assembly order that is chosen by design. This method includesthe steps of generating by design a plurality of specific nucleic acidbuilding blocks having serviceable mutually compatible ligatable ends,and assembling these nucleic acid building blocks, such that a designedoverall assembly order is achieved.

The mutually compatible ligatable ends of the nucleic acid buildingblocks to be assembled are considered to be “serviceable” for this typeof ordered assembly if they enable the building blocks to be coupled inpredetermined orders. Thus the overall assembly order in which thenucleic acid building blocks can be coupled is specified by the designof the ligatable ends. If more than one assembly step is to be used,then the overall assembly order in which the nucleic acid buildingblocks can be coupled is also specified by the sequential order of theassembly step(s). In one aspect, the annealed building pieces aretreated with an enzyme, such as a ligase (e.g. T4 DNA ligase), toachieve covalent bonding of the building pieces.

In one aspect, the design of the oligonucleotide building blocks isobtained by analyzing a set of progenitor nucleic acid sequencetemplates that serve as a basis for producing a progeny set of finalizedchimeric polynucleotide molecules. These parental oligonucleotidetemplates thus serve as a source of sequence information that aids inthe design of the nucleic acid building blocks that are to bemutagenized, e.g., chimerized or shuffled.

In one aspect of this method, the sequences of a plurality of parentalnucleic acid templates are aligned in order to select one or moredemarcation points. The demarcation points can be located at an area ofhomology, and are comprised of one or more nucleotides. Thesedemarcation points are preferably shared by at least two of theprogenitor templates. The demarcation points can thereby be used todelineate the boundaries of oligonucleotide building blocks to begenerated in order to rearrange the parental polynucleotides. Thedemarcation points identified and selected in the progenitor moleculesserve as potential chimerization points in the assembly of the finalchimeric progeny molecules. A demarcation point can be an area ofhomology (comprised of at least one homologous nucleotide base) sharedby at least two parental polynucleotide sequences. Alternatively, ademarcation point can be an area of homology that is shared by at leasthalf of the parental polynucleotide sequences, or, it can be an area ofhomology that is shared by at least two thirds of the parentalpolynucleotide sequences. Even more preferably a serviceable demarcationpoints is an area of homology that is shared by at least three fourthsof the parental polynucleotide sequences, or, it can be shared by atalmost all of the parental polynucleotide sequences. In one aspect, ademarcation point is an area of homology that is shared by all of theparental polynucleotide sequences.

In one aspect, a ligation reassembly process is performed exhaustivelyin order to generate an exhaustive library of progeny chimericpolynucleotides. In other words, all possible ordered combinations ofthe nucleic acid building blocks are represented in the set of finalizedchimeric nucleic acid molecules. At the same time, in anotherembodiment, the assembly order (i.e. the order of assembly of eachbuilding block in the 5′ to 3 sequence of each finalized chimericnucleic acid) in each combination is by design (or non-stochastic) asdescribed above. Because of the non-stochastic nature of this invention,the possibility of unwanted side products is greatly reduced.

In another aspect, the ligation reassembly method is performedsystematically. For example, the method is performed in order togenerate a systematically compartmentalized library of progenymolecules, with compartments that can be screened systematically, e.g.one by one. In other words this invention provides that, through theselective and judicious use of specific nucleic acid building blocks,coupled with the selective and judicious use of sequentially steppedassembly reactions, a design can be achieved where specific sets ofprogeny products are made in each of several reaction vessels. Thisallows a systematic examination and screening procedure to be performed.Thus, these methods allow a potentially very large number of progenymolecules to be examined systematically in smaller groups.

Because of its ability to perform chimerizations in a manner that ishighly flexible yet exhaustive and systematic as well, particularly whenthere is a low level of homology among the progenitor molecules, thesemethods provide for the generation of a library (or set) comprised of alarge number of progeny molecules. Because of the non-stochastic natureof the instant ligation reassembly invention, the progeny moleculesgenerated preferably comprise a library of finalized chimeric nucleicacid molecules having an overall assembly order that is chosen bydesign.

The saturation mutagenesis and optimized directed evolution methods alsocan be used to generate these amounts of different progeny molecularspecies.

It is appreciated that the invention provides freedom of choice andcontrol regarding the selection of demarcation points, the size andnumber of the nucleic acid building blocks, and the size and design ofthe couplings. It is appreciated, furthermore, that the requirement forintermolecular homology is highly relaxed for the operability of thisinvention. In fact, demarcation points can even be chosen in areas oflittle or no intermolecular homology. For example, because of codonwobble, i.e. the degeneracy of codons, nucleotide substitutions can beintroduced into nucleic acid building blocks without altering the aminoacid originally encoded in the corresponding progenitor template.Alternatively, a codon can be altered such that the coding for anoriginally amino acid is altered. This invention provides that suchsubstitutions can be introduced into the nucleic acid building block inorder to increase the incidence of intermolecularly homologousdemarcation points and thus to allow an increased number of couplings tobe achieved among the building blocks, which in turn allows a greaternumber of progeny chimeric molecules to be generated.

In another aspect, the synthetic nature of the step in which thebuilding blocks are generated allows the design and introduction ofnucleotides (e.g., one or more nucleotides, which may be, for example,codons or introns or regulatory sequences) that can later be optionallyremoved in an in vitro process (e.g. by mutageneis) or in an in vivoprocess (e.g. by utilizing the gene splicing ability of a hostorganism). It is appreciated that in many instances the introduction ofthese nucleotides may also be desirable for many other reasons inaddition to the potential benefit of creating a serviceable demarcationpoint.

Thus, according to another aspect, a nucleic acid building block can beused to introduce an intron. Thus, functional introns may be introducedinto a man-made gene manufactured according to the methods describedherein. The artificially introduced intron(s) can be functional in ahost cells for gene splicing much in the way that naturally-occurringintrons serve functionally in gene splicing.

Optimized Directed Evolution System

In practicing the methods of the invention, nucleic acids can also bemodified by a method comprising an optimized directed evolution system.Optimized directed evolution is directed to the use of repeated cyclesof reductive reassortment, recombination and selection that allow forthe directed molecular evolution of nucleic acids through recombination.Optimized directed evolution allows generation of a large population ofevolved chimeric sequences, wherein the generated population issignificantly enriched for sequences that have a predetermined number ofcrossover events.

A crossover event is a point in a chimeric sequence where a shift insequence occurs from one parental variant to another parental variant.Such a point is normally at the juncture of where oligonucleotides fromtwo parents are ligated together to form a single sequence. This methodallows calculation of the correct concentrations of oligonucleotidesequences so that the final chimeric population of sequences is enrichedfor the chosen number of crossover events. This provides more controlover choosing chimeric variants having a predetermined number ofcrossover events.

In addition, this method provides a convenient means for exploring atremendous amount of the possible protein variant space in comparison toother systems. Previously, if one generated, for example, 1013 chimericmolecules during a reaction, it would be extremely difficult to testsuch a high number of chimeric variants for a particular activity.Moreover, a significant portion of the progeny population would have avery high number of crossover events which resulted in proteins thatwere less likely to have increased levels of a particular activity. Byusing these methods, the population of chimerics molecules can beenriched for those variants that have a particular number of crossoverevents. Thus, although one can still generate 1013 chimeric moleculesduring a reaction, each of the molecules chosen for further analysismost likely has, for example, only three crossover events. Because theresulting progeny population can be skewed to have a predeterminednumber of crossover events, the boundaries on the functional varietybetween the chimeric molecules is reduced. This provides a moremanageable number of variables when calculating which oligonucleotidefrom the original parental polynucleotides might be responsible foraffecting a particular trait.

One method for creating a chimeric progeny polynucleotide sequence is tocreate oligonucleotides corresponding to fragments or portions of eachparental sequence. Each oligonucleotide preferably includes a uniqueregion of overlap so that mixing the oligonucleotides together resultsin a new variant that has each oligonucleotide fragment assembled in thecorrect order. Additional information can also be found in U.S. Ser. No.09/332,835. The number of oligonucleotides generated for each parentalvariant bears a relationship to the total number of resulting crossoversin the chimeric molecule that is ultimately created. For example, threeparental nucleotide sequence variants might be provided to undergo aligation reaction in order to find a chimeric variant having, forexample, greater activity at high temperature. As one example, a set of50 oligonucleotide sequences can be generated corresponding to eachportions of each parental variant. Accordingly, during the ligationreassembly process there could be up to 50 crossover events within eachof the chimeric sequences. The probability that each of the generatedchimeric polynucleotides will contain oligonucleotides from eachparental variant in alternating order is very low. If eacholigonucleotide fragment is present in the ligation reaction in the samemolar quantity it is likely that in some positions oligonucleotides fromthe same parental polynucleotide will ligate next to one another andthus not result in a crossover event. If the concentration of eacholigonucleotide from each parent is kept constant during any ligationstep in this example, there is a ⅓ chance (assuming 3 parents) that anoligonucleotide from the same parental variant will ligate within thechimeric sequence and produce no crossover.

Accordingly, a probability density function (PDF) can be determined topredict the population of crossover events that are likely to occurduring each step in a ligation reaction given a set number of parentalvariants, a number of oligonucleotides corresponding to each variant,and the concentrations of each variant during each step in the ligationreaction. The statistics and mathematics behind determining the PDF isdescribed below. By utilizing these methods, one can calculate such aprobability density function, and thus enrich the chimeric progenypopulation for a predetermined number of crossover events resulting froma particular ligation reaction. Moreover, a target number of crossoverevents can be predetermined, and the system then programmed to calculatethe starting quantities of each parental oligonucleotide during eachstep in the ligation reaction to result in a probability densityfunction that centers on the predetermined number of crossover events.

These methods are directed to the use of repeated cycles of reductivereassortment, recombination and selection that allow for the directedmolecular evolution of a nucleic acid encoding an polypeptide throughrecombination. This system allows generation of a large population ofevolved chimeric sequences, wherein the generated population issignificantly enriched for sequences that have a predetermined number ofcrossover events. A crossover event is a point in a chimeric sequencewhere a shift in sequence occurs from one parental variant to anotherparental variant. Such a point is normally at the juncture of whereoligonucleotides from two parents are ligated together to form a singlesequence. The method allows calculation of the correct concentrations ofoligonucleotide sequences so that the final chimeric population ofsequences is enriched for the chosen number of crossover events. Thisprovides more control over choosing chimeric variants having apredetermined number of crossover events.

In addition, these methods provide a convenient means for exploring atremendous amount of the possible protein variant space in comparison toother systems. By using the methods described herein, the population ofchimerics molecules can be enriched for those variants that have aparticular number of crossover events. Thus, although one can stillgenerate 1013 chimeric molecules during a reaction, each of themolecules chosen for further analysis most likely has, for example, onlythree crossover events. Because the resulting progeny population can beskewed to have a predetermined number of crossover events, theboundaries on the functional variety between the chimeric molecules isreduced. This provides a more manageable number of variables whencalculating which oligonucleotide from the original parentalpolynucleotides might be responsible for affecting a particular trait.

In one aspect, the method creates a chimeric progeny polynucleotidesequence by creating oligonucleotides corresponding to fragments orportions of each parental sequence. Each oligonucleotide preferablyincludes a unique region of overlap so that mixing the oligonucleotidestogether results in a new variant that has each oligonucleotide fragmentassembled in the correct order. See also U.S. Ser. No. 09/332,835.

The number of oligonucleotides generated for each parental variant bearsa relationship to the total number of resulting crossovers in thechimeric molecule that is ultimately created. For example, threeparental nucleotide sequence variants might be provided to undergo aligation reaction in order to find a chimeric variant having, forexample, greater activity at high temperature. As one example, a set of50 oligonucleotide sequences can be generated corresponding to eachportions of each parental variant. Accordingly, during the ligationreassembly process there could be up to 50 crossover events within eachof the chimeric sequences. The probability that each of the generatedchimeric polynucleotides will contain oligonucleotides from eachparental variant in alternating order is very low. If eacholigonucleotide fragment is present in the ligation reaction in the samemolar quantity it is likely that in some positions oligonucleotides fromthe same parental polynucleotide will ligate next to one another andthus not result in a crossover event. If the concentration of eacholigonucleotide from each parent is kept constant during any ligationstep in this example, there is a ⅓ chance (assuming 3 parents) that aoligonucleotide from the same parental variant will ligate within thechimeric sequence and produce no crossover.

Accordingly, a probability density function (PDF) can be determined topredict the population of crossover events that are likely to occurduring each step in a ligation reaction given a set number of parentalvariants, a number of oligonucleotides corresponding to each variant,and the concentrations of each variant during each step in the ligationreaction. The statistics and mathematics behind determining the PDF isdescribed below. One can calculate such a probability density function,and thus enrich the chimeric progeny population for a predeterminednumber of crossover events resulting from a particular ligationreaction. Moreover, a target number of crossover events can bepredetermined, and the system then programmed to calculate the startingquantities of each parental oligonucleotide during each step in theligation reaction to result in a probability density function thatcenters on the predetermined number of crossover events.

Determining Crossover Events

Embodiments of the invention include a system and software that receivea desired crossover probability density function (PDF), the number ofparent genes to be reassembled, and the number of fragments in thereassembly as inputs. The output of this program is a “fragment PDF”that can be used to determine a recipe for producing reassembled genes,and the estimated crossover PDF of those genes. The processing describedherein is preferably performed in MATLAB@ (The Mathworks, Natick, Mass.)a programming language and development environment for technicalcomputing.

Iterative Processes

In practicing the methods of the invention, the process can beiteratively repeated. For example a nucleic acid (or, the nucleic acid)responsible for an altered phenotype is identified, re-isolated, againmodified, reinserted into the cell, and the process of real-timemetabolic flux analysis is iteratively repeated. The process can beiteratively repeated until a desired phenotype is engineered. Forexample, an entire biochemical pathway can be engineered into a cell.Any cell phenotype can be modified or any phenotype can be added to acell using the methods of the invention, without limitation.

Nucleic acids can be modified using either stochastic or non-stochasticmethods. In various aspects, the methods generate sets of chimericnucleic acid and protein molecules, followed by insertion into a cell,culturing, and then screening by using real-time metabolic flux analysisfor a particular activity, such as a changed or added desired phenotype.The invention is not limited to only a single round of screening. Basedon this determination, a second round of reassembly can take place thatenriches for progeny having a desired property or incurring a desiredphenotype.

Similarly, if it is determined that a particular oligonucleotide has noaffect at all on the desired trait (e.g., a new phenotype), it can beremoved as a variable by synthesizing larger parental oligonucleotidesthat include the sequence to be removed. Since incorporating thesequence within a larger sequence prevents any crossover events, therewill no longer be any variation of this sequence in the progenypolynucleotides. This iterative practice of determining whicholigonucleotides are most related to the desired trait, and which areunrelated, allows more efficient exploration all of the possible proteinvariants that might be provide a particular trait or activity.

Automated Control of Reactions

The process of generating any of the reactions of the methods of theinvention can be automated with the assistance of automated devices androbotic instruments. For example, in one aspect, a cell growth monitordevice is used for real-time metabolic flux analysis, such as aWedgewood Technology, Inc., Cell Growth Monitor model 652. As notedbelow, this device can be linked to a computer system. Another exemplarydevice is a TECAN GENESIS™ programmable robot made by Tecan Corporation(Hombrechtikon, Switzerland), which can be interfaced with a computerthat determines the quantities of each oligonucleotide fragment to yielda resulting PDF. By linking a computer system that determines the properquantities of each oligonucleotide to an automated robot, a completeligation reassembly system is produced. Data links through serial orother interfaces will allow the data files generated from the ligationreassembly calculations to be forwarded in the proper format for therobotic system to automatically begin allocating the proper quantitiesof each oligonucleotide fragment into a reaction tube.

The automated system can include a plurality of oligonucleotidefragments derived from a series of nucleic acid sequence variants,wherein said fragments are configured to join one another at uniqueoverhangs. The system also has a data input field configured to store atarget number of crossover events in for each of the variant sequences.Within the system is also a prediction module configured to determinethe quantity of each of the fragments to admix together so that mixingthe fragments results in a population of progeny molecules that areenriched for crossover events corresponding to the target number. Thesystem also provides a robotic arm linked to the prediction modulethrough a communication interface for automatically mixing the fragmentsin the determined quantities.

Mutagenized Oligonucleotides

While the optimized directed evolution method can use oligonucleotidesthat have a 100% fidelity to their parent polynucleotide sequence, thislevel of fidelity is not required. For example, if a set of threerelated parental polynucleotides are chosen to undergo ligationreassembly in order to create, e.g., a new phenotype, a set ofoligonucleotides having unique overlapping regions can be synthesized byconventional methods. However a set of mutagenized oligonucleotidescould also be synthesized. These mutagenized oligonucleotides arepreferably designed to encode silent, conservative, or non-conservativeamino acids.

The choice to enter a silent mutation might be made to, for example, adda region of nucleotide homology two fragments, but not affect the finaltranslated protein. A non-conservative or conservative substitution ismade to determine how such a change alters the function of the resultantpolypeptide. This can be done if, for example, it is determined thatmutations in one particular oligonucleotide fragment were responsiblefor increasing the activity of a peptide. By synthesizing mutagenizedoligonucleotides (e.g.: those having a different nucleotide sequencethan their parent), one can explore, in a controlled manner, howresulting modifications to the peptide or protein sequence affect theactivity of the peptide or polypeptide.

Another method for creating variants of a nucleic acid sequence usingmutagenized fragments includes first aligning a plurality of nucleicacid sequences to determine demarcation sites within the variants thatare conserved in a majority of said variants, but not conserved in allof said variants. A set of first sequence fragments of the conservednucleic acid sequences are then generated, wherein the fragments bind toone another at the demarcation sites. A second set of fragments of thenot conserved nucleic acid sequences are then generated by, for example,a nucleic acid synthesizer. However, the not conserved, sequences aregenerated to have mutations at their demarcation site so that the secondfragments have the same nucleotide sequence at the demarcation sites assaid first fragments. This allows the not conserved sequences to stillhybridize during the ligation reaction to the other parental sequences.Once the fragments are generated, a desired number of crossover eventscan be selected for each of the variants. The quantity of each of thefirst and second fragments is then calculated so that aligation/incubation reaction between the calculated quantities of thefirst and second fragments will result in progeny molecules having thedesired number of crossover events.

In Silico, or Computer, Models

In silico, or computer program-implemented, paradigms can be used inpracticing the methods of the invention to design altered or new nucleicacids to modify cells for the creation of new phenotypes. One exemplaryin silico method that can be used in practicing the methods of theinvention for generating man-made polynucleotide sequences for thecreation of new phenotypes detects shared domains between a plurality oftemplate polynucleotides. It does so by aligning the templatepolynucleotides and identifying all sequence strings having a certainpercentage of homology, e.g., about 75% to 95% sequence identity, thatare shared between all of the template polynucleotides. This detectsshared domains between the template polynucleotides. Next, domainsequences are switched from one template polynucleotide with thesequence of a corresponding domain. This is repeated until all domainshave been switched with a corresponding domain on another templatepolynucleotide, thereby generating in silico a library of man-madepolynucleotide sequences from a set of template polynucleotides.

In silico, or computer program-implemented, methods can also be used inpracticing the methods of the invention to analyze metabolic flux data;see, e.g., Covert (2001) Trends Biochem. Sci. 26(3): 179-186; Jamshidi(2001) Bioinformatics 17(3): 286-287. For example, the quantitativerelationship between a primary carbon source (e.g., for bacteria,acetate or succinate) uptake rate, oxygen uptake rate, and maximalcellular growth rate can be modeled in silico, and used complementary tothe “real-time” or “on-line” monitoring of the invention, see, e.g.,Edwards (2001) Nat. Biotechnol. 19(2): 125-130. The effects of genedeletions in a central metabolic pathway can also be modeled in silico,and used complementary to the “real-time” or “on-line” monitoring of theinvention, see, e.g., Edwards (2000) Proc. Natl. Acad. Sci. USA 97(10):5528-5533.

Measuring Metabolic Parameters

The methods of the invention involve whole cell evolution, or whole cellengineering, of a cell to develop a new cell strain having a newphenotype. To detect the new phenotype, at least one metabolic parameterof a modified cell is monitored in the cell in a “real time” or“on-line” time frame. In one aspect, a plurality of cells, such as acell culture, is monitored in “real time” or “on-line.” In one aspect, aplurality of metabolic parameters is monitored in “real time” or“on-line.”

Metabolic flux analysis (MFA) is based on a known biochemistryframework. A linearly independent metabolic matrix is constructed basedon the law of mass conservation and on the pseudo-steady statehypothesis (PSSH) on the intracellular metabolites. In practicing themethods of the invention, metabolic networks are established, includingthe:

-   -   identity of all pathway substrates, products and intermediary        metabolites    -   identity of all the chemical reactions interconverting the        pathway metabolites, the stoichiometry of the pathway reactions,    -   identity of all the enzymes catalysing the reactions, the enzyme        reaction kinetics,    -   the regulatory interactions between pathway components, e.g.        allosteric interactions, enzyme-enzyme interactions etc,    -   intracellular compartmentalisation of enzymes or any other        supramolecular organisation of the enzymes, and,    -   the presence of any concentration gradients of metabolites,        enzymes or effector molecules or diffusion barriers to their        movement.

Once the metabolic network for a given strain is built, mathematicpresentation by matrix notion can be introduced to estimate theintracellular metabolic fluxes if the on-line metabolome data isavailable.

Metabolic phenotype relies on the changes of the whole metabolic networkwithin a cell. Metabolic phenotype relies on the change of pathwayutilization with respect to environmental conditions, geneticregulation, developmental state and the genotype, etc. In one aspect ofthe methods of the invention, after the on-line MFA calculation, thedynamic behavior of the cells, their phenotype and other properties areanalyzed by investigating the pathway utilization. For example, if theglucose supply is increased and the oxygen decreased during the yeastfermentation, the utilization of respiratory pathways will be reducedand/or stopped, and the utilization of the fermentative pathways willdominate. Control of physiological state of cell cultures will becomepossible after the pathway analysis. The methods of the invention canhelp determine how to manipulate the fermentation by determining how tochange the substrate supply, temperature, use of inducers, etc. tocontrol the physiological state of cells to move along desirabledirection. In practicing the methods of the invention, the MFA resultscan also be compared with transcriptome and proteome data to designexperiments and protocols for metabolic engineering or gene shuffling,etc.

In practicing the methods of the invention, any modified or newphenotype can be conferred and detected, including new or improvedcharacteristics in the cell. Any aspect of metabolism or growth can bemonitored.

Monitoring Expression of an mRNA Transcript

In one aspect of the invention, the engineered phenotype comprisesincreasing or decreasing the expression of an mRNA transcript orgenerating new transcripts in a cell. mRNA transcript, or message can bedetected and quantified by any method known in the art, including, e.g.,Northern blots, quantitative amplification reactions, hybridization toarrays, and the like. Quantitative amplification reactions include,e.g., quantitative PCR, including, e.g., quantitative reversetranscription polymerase chain reaction, or RT-PCR; quantitative realtime RT-PCR, or “real-time kinetic RT-PCR” (see, e.g., Kreuzer (2001)Br. J. Haematol. 114: 313-318; Xia (2001) Transplantation 72: 907-914).

In one aspect of the invention, the engineered phenotype is generated byknocking out expression of a homologous gene. The gene's coding sequenceor one or more transcriptional control elements can be knocked out,e.g., promoters enhancers. Thus, the expression of a transcript can becompletely ablated or only decreased.

In one aspect of the invention, the engineered phenotype comprisesincreasing the expression of a homologous gene. This can be effected byknocking out of a negative control element, including a transcriptionalregulatory element acting in cis- or trans-, or, mutagenizing a positivecontrol element.

As discussed below in detail, one or more, or, all the transcripts of acell can be measured by hybridization of a sample comprising transcriptsof the cell, or, nucleic acids representative of or complementary totranscripts of a cell, by hybridization to immobilized nucleic acids onan array.

Monitoring Expression of a Polypeptides, Peptides and Amino Acids

In one aspect of the invention, the engineered phenotype comprisesincreasing or decreasing the expression of a polypeptide or generatingnew polypeptides in a cell. Polypeptides, peptides and amino acids canbe detected and quantified by any method known in the art, including,e.g., nuclear magnetic resonance (NMR), spectrophotometry, radiography(protein radiolabeling), electrophoresis, capillary electrophoresis,high performance liquid chromatography (HPLC), thin layer chromatography(TLC), hyperdiffusion chromatography, various immunological methods,e.g. immunoprecipitation, immunodiffusion, immuno-electrophoresis,radioimmunoassays (RIAs), enzyme-linked immunosorbent assays (ELISAs),immuno-fluorescent assays, gel electrophoresis (e.g., SDS-PAGE),staining with antibodies, fluorescent activated cell sorter (FACS),pyrolysis mass spectrometry, Fourier-Transform Infrared Spectrometry,Raman-spectrometry, GC-MS, and LC-Electrospray andcap-LC-tandem-electrospray mass spectrometries, and the like. Novelbioactivities can also be screened using methods, or variations thereof,described in U.S. Pat. No. 6,057,103. Furthermore, as discussed below indetail, one or more, or, all the polypeptides of a cell can be measuredusing a protein array.

Biosynthetically directed fractional ¹³C labeling of proteinogenic aminoacids can be monitored by feeding a mixture of uniformly ¹³C-labeled andunlabeled carbon source compounds into a bioreaction network. Analysisof the resulting labeling pattern enables both a comprehensivecharacterization of the network topology and the determination ofmetabolic flux ratios of the amino acids; see, e.g., Szyperski (1999)Metab. Eng. 1: 189-197.

Monitoring the Expression of a Metabolites and Biosynthetic Pathways

In one aspect, primary and secondary metabolites are the measuredmetabolic parameters. Any relevant primary and secondary metabolite canbe monitored in real time. For example, the measured metabolic parametercan comprise an increase or a decrease in a primary or a secondarymetabolite. The secondary metabolite can be, e.g., a glycerol or amethanol. The measured metabolic parameter can comprise an increase or adecrease in an organic acid, such as an acetate, a butyrate, a succinateand an oxaloacetate. In one aspect, the metabolic parameter measuredcomprises an increase or a decrease in an organic acid, such as anacetate, a butyrate, a succinate and an oxaloacetate.

The choice of which metabolite or metabolic or biosynthetic pathway tomonitor “on-line” or in “real time” depends on which phenotype isdesired to be added or modified. For example, limonene and otherdownstream metabolites of geranyl pyrophosphate can be monitored“on-line” or in “real time” as in U.S. Pat. No. 6,291,745, whichmonitored to generate means for insect control in plants, see, e.g.,Metabolites/antibiotics in the supernatant in Bacillus subtilis can bemonitored for effective insecticidal, antifungal and antibacterialagents, see, e.g., U.S. Pat. No. 6,291,426. The methods of the inventioncan also be used to monitor metabolites of the tricarboxylic acid cycleand glycolysis, as in a Bacillus subtilis strain by Sauer (1997) Nat.Biotechnol. 15: 448-452 (who also used fractional 13C-labeling andtwo-dimensional nuclear magnetic resonance spectroscopy). The penicillinbiosynthetic pathway can be monitored in real time in, e.g., Penicilliumchrysogenum; see, e.g., Nielsen (1995) Biotechnol. Prog. 11(3): 299-305;Jorgensen (1995) Appl. Microbiol. Biotechnol. 43(1): 123-130. Asparaginelinked (N-linked) glycosylation can be studied in real time; see, e.g.,Nyberg (1999) Biotechnol. Bioeng. 62(3): 336-347. The amount of aminoacids liberated from peptides in cell cultures grown in ahydrolysate-supplemented medium can be studied in real time; see, e.g.,Nyberg (1999) Biotechnol. Bioeng. 62(3): 324-335, who studies pathwayfluxes in Chinese hamster ovary cells grown in a complex (hydrolysatecontaining) medium. The methods of the invention can also be used tomonitor flux distributions for maximal ATP production in mitochondria,including ATP yields for glucose, lactate, and palmitate; see, e.g.,Ramakrishna (2001) Am. J. Physiol. Regul. Integr. Comp. Physiol. 280(3):R695-704. In bacteria, the methods of the invention can also be used tomonitor seven essential reactions in the central metabolic pathways,glycolysis, pentose phosphate pathway, tricarboxylic acid cycle, for thegrowth in a glucose medium, e.g., glucose minimal media. For genemodification, the seven genes encoding these enzymes can be grouped intothree categories: (1) pentose phosphate pathway genes, (2) three-carbonglycolytic genes, and (3) tricarboxylic acid cycle genes. See, e.g.,Edwards (2000) Biotechnol. Prog. 16(6): 927-939.

Monitoring Intracellular pH

In one aspect, the increase or a decrease in intracellular pH ismeasured “on-line” or in “real time.” The change in intracellular pH canbe measured by intracellular application of a dye. The change influorescence of the dye can be measured over time.

Any system can be used to determine intracellular pH. If a dye if used,in one exemplary method, whole-field time-domain fluorescence lifetimeimaging (FLIM) can be used. FLIM can be used for the quantitativeimaging of concentration ratios of mixed fluorophores and quantitativeimaging of perturbations to fluorophore environment; in FLIM, the imagecontrast is derived from the fluorescence lifetime at each point in atwo-dimensional image (see, e.g., Cole (2001) J. Microsc. 203(Pt 3):246-257). Near-field scanning optical microscopy (NSOM) is ahigh-resolution scanning probe technique that can be used to obtainsimultaneous optical and topographic images with spatial resolution oftens of nanometers (see, e.g., Kwak (2001) Anal. Chem. 73(14):3257-3262). A frequency domain fluorescence lifetime imaging microscope(FLIM) enables the measurement and reconstruction of three-dimensionalnanosecond fluorescence lifetime images (see, e.g., Squire (1999) J.Microsc. 193(Pt 1): 3649).

Monitoring Expression of Gases

In one aspect, the measured metabolic parameter comprises gas exchangerate measurements. Any gas can be monitored, e.g., oxygen, carbonmonoxide, carbon dioxide, nitrogen and the like. See, e.g., Follstad(1999) Biotechnol. Bioeng. 63(6): 675-683.

Screening Methodologies and “On-Line” Monitoring Devices

In practicing the methods of the invention, “real time” or “on-line”cell monitoring devices are used to identify an engineered phenotype inthe cell using real-time metabolic flux analysis. Any screening methodcan be used in conjunction with these “real time” or “on-line” cellmonitoring devices.

Cell Growth Monitor Devices

In one aspect, real time monitoring of a plurality of metabolicparameters is done with use of a cell growth monitor device. Oneexemplary such device is a Wedgewood Technology, Inc. (San Carlos,Calif.), Cell Growth Monitor model 652, which can “real time” or“on-line” monitor a variety of metabolic parameters, including: theuptake of substrates, such as glucose; the levels of intracellularintermediates, such as' organic acids, e.g., acetate, butyrate,succinate, oxaloacetate; and, levels of amino acids. Any cell growthmonitor device can be used, and these devices can be modified to measureany set of parameters, without limitation. Cell growth monitor devicecan be used in conjunction with any other measuring or monitoringdevices, such as There are some rapid analysis of metabolites at thewhole-cell level, using methods such as pyrolysis mass spectrometry,Fourier-Transform Infrared Spectrometry, Raman spectrometry, GC-MS, andLC-Electrospray and cap-LC-tandem-electrospray mass spectrometries.

Capillary Arrays

In addition to “biochip” arrays (see below), capillary arrays, such asthe GIGAMATRIX™, Diversa Corporation, San Diego, Calif., can be used toscreen for or monitor a variety of compositions, including polypeptides,nucleic acids, metabolites, by-products, antibiotics, metals, and thelike, without limitation. Capillary arrays provide another system forholding and screening samples. For example, a sample screening apparatuscan include a plurality of capillaries formed into an array of adjacentcapillaries, wherein each capillary comprises at least one wall defininga lumen for retaining a sample. The apparatus can further includeinterstitial material disposed between adjacent capillaries in thearray, and one or more reference indicia formed within of theinterstitial material. A capillary for screening a sample, wherein thecapillary is adapted for being bound in an array of capillaries, caninclude a first wall defining a lumen for retaining the sample, and asecond wall formed of a filtering material, for filtering excitationenergy provided to the lumen to excite the sample.

A polypeptide or nucleic acid, e.g., a ligand, can be introduced into afirst component into at least a portion of a capillary of a capillaryarray. Each capillary of the capillary array can comprise at least onewall defining a lumen for retaining the first component, and introducingan air bubble into the capillary behind the first component. A secondcomponent can be introduced into the capillary, wherein the secondcomponent is separated from the first component by the air bubble. Asample of interest can be introduced as a first liquid labeled with adetectable particle into a capillary of a capillary array, wherein eachcapillary of the capillary array comprises at least one wall defining alumen for retaining the first liquid and the detectable particle, andwherein the at least one wall is coated with a binding material forbinding the detectable particle to the at least one wall. The method canfurther include removing the first liquid from the capillary tube,wherein the bound detectable particle is maintained within thecapillary, and introducing a second liquid into the capillary tube.

The capillary array can include a plurality of individual capillariescomprising at least one outer wall defining a lumen. The outer wall ofthe capillary can be one or more walls fused together. Similarly, thewall can define a lumen that is cylindrical, square, hexagonal or anyother geometric shape so long as the walls form a lumen for retention ofa liquid or sample. The capillaries of the capillary array can be heldtogether in close proximity to form a planar structure. The capillariescan be bound together, by being fused (e.g., where the capillaries aremade of glass), glued, bonded, or clamped side-by-side. The capillaryarray can be formed of any number of individual capillaries, forexample, a range from 100 to 4,000,000 capillaries. A capillary arraycan form a microtiter plate having about 100,000 or more individualcapillaries bound together.

Arrays, or “BioChips”

In one aspect of the invention, the monitored parameter is transcriptexpression. One or more, or, all the transcripts of a cell can bemeasured by hybridization of a sample comprising transcripts of thecell, or, nucleic acids representative of or complementary totranscripts of a cell, by hybridization to immobilized nucleic acids onan array, or “biochip.” By using an “array” of nucleic acids on amicrochip, some or all of the transcripts of a cell can besimultaneously quantified. Arrays comprising genomic nucleic acid canalso be used to determine the genotype of a newly engineered strain madeby the methods of the invention. “Polypeptide arrays” can also be usedto simultaneously quantify a plurality of proteins.

The present invention can be practiced with any known “array,” alsoreferred to as a “microarray” or “nucleic acid array” or “polypeptidearray” or “antibody array” or “biochip,” or variation thereof. Arraysare generically a plurality of “spots” or “target elements,” each targetelement comprising a defined amount of one or more biological molecules,e.g., oligonucleotides, immobilized onto a defined area of a substratesurface for specific binding to a sample molecule, e.g., mRNAtranscripts.

In practicing the methods of the invention, known arrays and methods ofmaking and using arrays can be incorporated in whole or in part, orvariations thereof, as described, for example, in U.S. Pat. Nos.6,277,628; 6,277,489; 6,261,776; 6,258,606; 6,054,270; 6,048,695;6,045,996; 6,022,963; 6,013,440; 5,965,452; 5,959,098; 5,856,174;5,830,645; 5,770,456; 5,632,957; 5,556,752; 5,143,854; 5,807,522;5,800,992; 5,744,305; 5,700,637; 5,556,752; 5,434,049; see also, e.g.,WO 99/51773; WO 99/09217; WO 97/46313; WO 96/17958; see also, e.g.,Johnston (1998) Curr. Biol. 8: R171-R174; Schummer (1997) Biotechniques23: 1087-1092; Kern (1997) Biotechniques 23: 120-124; Solinas-Toldo(1997) Genes, Chromosomes & Cancer 20: 399-407; Bowtell (1999) NatureGenetics Supp. 21: 25-32. See also published U.S. patent applicationsNos. 20010018642; 20010019827; 20010016322; 20010014449; 20010014448;20010012537; 20010008765. The present invention can use any known array,e.g., GeneChips™, Affymetrix, Santa Clara, Calif.; SpectralChip™ HumanBAC Arrays, Spectral Genomics, Houston, Tex.; and their accompanyingmanufacturer's instructions.

Antibodies and Immunoblots

In practicing the methods of the invention, antibodies can be used toisolate, identify or quantify particular polypeptides orpolysaccharides. The antibodies can be used in immunoprecipitation,staining (e.g., FACS), immunoaffinity columns, and the like. If desired,nucleic acid sequences encoding for specific antigens can be generatedby immunization followed by isolation of polypeptide or nucleic acid,amplification or cloning and immobilization of polypeptide onto an arrayof the invention. Alternatively, the methods of the invention can beused to modify the structure of an antibody produced by a cell to bemodified, e.g., an antibody's affinity can be increased or decreased.Furthermore, the ability to make or modify antibodies can be a phenotypeengineered into a cell by the methods of the invention.

Methods of immunization, producing and isolating antibodies (polyclonaland monoclonal) are known to those of skill in the art and described inthe scientific and patent literature, see, e.g., Coligan, CURRENTPROTOCOLS IN IMMUNOLOGY, Wiley/Greene, NY (1991); Stites (eds.) BASICAND CLINICAL IMMUNOLOGY (7th ed.) Lange Medical Publications, Los Altos,Calif. (“Stites”); Goding, MONOCLONAL ANTIBODIES: PRINCIPLES ANDPRACTICE (2d ed.) Academic Press, New York, N.Y. (1986); Kohler (1975)Nature 256: 495; Harlow (1988) ANTIBODIES, A LABORATORY MANUAL, ColdSpring Harbor Publications, New York. Antibodies also can be generatedin vitro, e.g., using recombinant antibody binding site expressing phagedisplay libraries, in addition to the traditional in vivo methods usinganimals. See, e.g., Hoogenboom (1997) Trends Biotechnol. 15: 62-70; Katz(1997) Annu. Rev. Biophys. Biomol. Struct. 26: 27-45.

Sources of Cells and Culturing of Cells

The invention provides a method for whole cell engineering of newphenotypes by using real-time metabolic flux analysis. Any cell can beengineered, including, e.g., bacterial, Archaebacteria, mammalian,yeast, fungi, insect or plant cell. In one aspect of the methods of theinvention, a cell is modified by addition of a heterologous nucleic acidinto the cell. The heterologous nucleic acid can be isolated, cloned orreproduced from a nucleic acid from any source, including any bacterial,mammalian, yeast, insect or plant cell.

In one aspect, the cell can be from a tissue or fluid taken from anindividual, e.g., a patient. The cell can be homologous, e.g., a humancell taken from a patient, or, heterologous, e.g., a bacterial or yeastcell taken from the gastrointestinal tract of an individual. The cellcan be from, e.g., lymphatic or lymph node samples, serum, blood, chordblood, CSF or bone marrow aspirations, fecal samples, saliva, tears,tissue and surgical biopsies, needle or punch biopsies, and the like.

Any apparatus to grow or maintain cells can be used, e.g., a bioreactoror a fermentor, see, e.g., U.S. Pat. Nos. 6,242,248; 6,228,607;6,218,182; 6,174,720; 6,168,949; 6,133,022; 6,133,021; 6,048,721;5,660,977; 5,075,234.

Real-time Metabolic Flux Analysis

In the methods of the invention, at least one metabolic parameter of thecell is monitored in real time, i.e., by real time, or “on-line,” fluxanalysis. In alternative aspects, many parameters of the cells inculture are monitored simultaneously in real time. Because of thereal-time distribution of substrates, intermediates and products betweenalternative metabolic pathways is not accessible by the usual analyticalmeans, the present invention incorporates an MFA method with “on-line”or “real-time” metabolome data. Therefore, by calculation, the metabolicflux distributions during the fermentation can be quantified. The fluxquantification and gene expression analysis, along with sophisticatedexperimental techniques, can be combined to upgrade the content ofinformation in the physiological and genomic/proteomic data towards theunraveling of cellular function and regulation. This allows insight intometabolic pathways, which is highly desirable and necessary in order tounderstand the behavior of the organism.

Metabolic Flux Analysis (MFA) is an analysis technique for metabolicengineering. It has been used in connection with studies of cellmetabolism where the aim is to direct as much carbon as possible fromthe substrate into the biomass and products. Example 1, below, generallydescribes an exemplary Metabolic Flux Analysis (MFA) that can be used inthe methods of the invention.

“Metabolomics” is a relatively unexplored field and can encompass theanalysis of all cellular metabolites. Metabolomics provides a powerfulnew tool for gaining insight into functional biology, and has providedsnapshots of the levels of numerous small molecules within a cell, andhow those levels change under different conditions. These studies arevery complementary to gene and polypeptide expression studies (genomicsand proteomics), which are actively being applied to studies ofinfectious diseases, production, and model organisms, as well as humancells and plants. The present invention provides an improved methodologyto study “metabolomics” by providing a method for whole cell engineeringof new or modified phenotypes by using real-time metabolic fluxanalysis.

In practicing the methods of the invention, cellular control can bestudied at different hierarchical levels, at the level of the genome, atthe level of the transcriptome, at the level of the proteome or at thelevel of the metabolome. Whilst there is much current interest in thegenome-wide analysis of cells at the level of transcription (to definethe ‘transcriptome’) and translation (to define the ‘proteome’), thethird level of analysis, that of the ‘metabolome’, has been curiouslyunexplored to date. The term ‘metabolome’ refers to the entirecomplement of all the small molecular weight metabolites inside a cellsuspension (or other sample) of interest. It is likely that measurementof the metabolome in different physiological states, particularly usingthe methods of the invention, will in fact be much more discriminatingfor the purposes of functional genomics.

The genome (the total genetic material in the cell) specifies anorganism's total repertoire of responses. The genomes of severalorganisms have now been completely sequenced and several others are nearcompletion or well under way (including a number of parasites). Of thegenes so far sequenced via the systematic genome sequencing programs,the functions of fewer than half are known with any confidence.Technological advances now allow gene expression at any particular stageof development or in any particular physiological state to be analyzed.Such analyses can be carried out at the level of transcription usingeither Northern blots or, more efficiently, using hybridization arraytechnologies to determine which genes are being expressed underdifferent sets of conditions, i.e., the “transcriptome.” Similaranalyses can be carried out at the level of translation to define the“proteome,” i.e., the total protein complement of the cell. Improvementsin 2D electrophoresis and computer software for advanced image analysisallow 1-2×10³ proteins to be resolved on a single 20×20 cm plate; and,mass spectrometry coupled with database searching provides a method forrapid protein identification. Changes in the transcriptome represent theinitial response of a cell to change, while changes in the proteomerepresent the final response at the level of the macromolecule. Thethird level of analysis, and one analyzed by the methods of theinvention, is that of the “metabolome,” which includes the quantitativecomplement of all the low molecular weight molecules present in cells ina particular physiological or developmental state.

Metabolite levels, which are monitored in alternative aspects of theinvention, are thus the variables of choice to measure in a quantitativeanalysis of cellular function. Metabolites represent the down streamamplification of changes occurring in the transcriptome or the proteome.Moreover, metabolites regulate gene expression through a network offeedback pathways such that metabolites drive expression and act as thelink between the genome and metabolism. The number of metabolites in themetabolome is also lower, by about an order of magnitude than the numberof gene products in the transcriptome or the proteome (a typicaleukaryotic cell contains around 10⁵ genes and 10⁴ different expressedproteins but only about 10³ different known metabolites). Therefore, inorder to understand intermediary metabolism and to exploit thisknowledge changes in the metabolome are much more relevant and will bemuch easier both to detect and to exploit than changes either in thetranscriptome or the proteome.

The methods of the invention, by identifying sites of specific metaboliclesions via the metabolome, in addition to its inherent scientificinterest, will lead to the detection of targets for potentially novelpharmaceuticals or agrochemicals in whole cells. The methods of theinvention can also be used to design functional assays. From theseresults, they can enable the design of very much simpler assays in whichonly the targeted metabolites are studied for specific high throughput,mechanistic assays.

The metabolome analysis of the invention has the advantage of being anonline non-invasive technology. While static metabolome analysis hassome advantages over transcriptome and proteome analysis because, formany organisms, the number of metabolites was far fewer than the numberof genes or proteins. However, static metabolome analysis had anintrinsic disadvantage as well. This was that while biochemistry couldgenerate information about the metabolic pathways, there is no directlink between the metabolites and the genes. They were also problems inanalysing the concentration or even the very presence of certainmetabolites. Current identification technologies such as infra-redspectrometry, mass spectrometry, or nuclear magnetic resonancespectroscopy produced some information but their use was limited andcould not properly analyze a living cell. The methods of the invention,by providing “online” or “real-time” non-invasive technology solved thisproblem. The “online” or “real-time” time dimension of the methods ofthe invention, lacking in older techniques is one important factor inthe methods ability to analyze a living cell.

Metabolic flux analysis (MFA) is a powerful analysis tool that cancouple observed extracellular phenomena, such as uptake/excretion rates,growth rate, product and biomass yields, etc., with the intracellularcarbon flux and energy distribution. The “on-line” or “real-time” MFA ofthe invention can be used to investigate the physiology of Escherichiacoli, Saccharomyces cerevisiae, and hybridomas (see, e.g., Keasling(1998) Biotechnol. Bioeng. 5; 58(2-3): 231-239; Pramanik (1998)Biotechnol. Bioeng. 60(2): 230-238; Nissen et al., 1997; Schulze et al.,1996; Follstad et al., 1999), lysine production and the effect ofmutations in Corynebacterium glutamicum (see, e.g., Vallino (2000)Biotechnol. Bioeng. 67(6): 872-885; Vallino and Stephanopoulos, 1993,1994; Park et al., 1997; Dominguez (1998) Eur. J. Biochem. 254(1):96-102), riboflavin production in Bacillus subtilis (see, e.g., Sauer etal., 1996, 1998; Sauer (1997) Nat. Biotechnol. 15: 448-452), penicillinproduction in Penicillium chrysogenum (Nielsen (1995) Biotechnol. Prog.11(3): 299-305; Jorgensen (1995) Appl. Microbiol. Biotechnol. 43(1):123-130); and, peptide amino acid metabolism in Chinese hamster ovary(CHO) cells (see, e.g., Nyberg (1999) Biotechnol. Bioeng. 62(3):324-335; Nyberg (1999) Biotechnol. Bioeng. 62(3): 336-347).

Moreover, the “on-line” or “real-time” MFA of the invention can be usedin combination with NMR, MS, and/or GC-MS to yield hard to getinformation about futile cycles, the degree of reaction reversibility,as well as active pathways; see, e.g., Szyperski (1999) Metab. Eng. 1:189-197; Szyperski (1998) Q Rev. Biophys. 31: 41-106; Szyperski (1995)Eur. J. Biochem. 232(2): 433-448; Szyperski et al., 1997; Schmidt etal., 1998; Klapa (1999) Biotechnol. Bioeng. 62(4): 375-391; Mollney etal., 1999; Park et al., 1999; Wiechert et al., 1999; Wittmann andHeinzle, 1999. Schilling, Edwards, and Palsson have even extended theuse of MFA to include the analysis of genomic data and the structuralproperties of cellular networks (Schilling (2000-2001) Biotechnol.Bioeng. 71 (4): 286-306; Edwards and Palsson, 1998; Schilling et al.,1999a,b); to monitor the C(3)-C(4) metabolite interconversion at theanaplerotic node in many microorganisms (see, e.g., Petersen (2000) J.Biol. Chem. 275(46): 35932-35941).

In MFA, the intracellular fluxes are calculated using a stoichiometricmodel for all the major intracellular reactions and by applying massbalances around the intracellular metabolites. As input to thecalculations, a set of measured fluxes, typically the uptake rates ofsubstrates and secretion rates of metabolites is used.

The novel “real-time” or “on-line” metabolic flux analysis of theinvention can provide data regarding a full suite of metabolitessynthesized by a biological system under given environmental conditionsand/or with genetic regulation. The “real-time” or “on-line” MFA methodsof the invention can provide metabolomic data sets that are extremelycomplex. The MFA methods of the invention can be an adequate tool tohandle, store, normalize, and evaluate the acquired data in order todescribe the systemic response of a complex biological system. The FIG.1 is a schematic illustrating the invention's new application of MFA todetermine new phenotypes, pathway utilizations and cell responses to thestudied strains during actual cell culture or fermentation periods. Theresults can be either used for post-fermentation analysis, or immediatecontrol of the metabolism.

The “on-line,” or “real-time” methods of the invention can alsoincorporate other analytical devices, such as HPLC and GC/MS, toestimate flux distribution in metabolic networks (constructed with ourbiochemical knowledge and genomic/proteomic information database) fromexperimental measurements. With these devices, “snapshots” of thebiological systems under study can be obtained periodically, e.g., aboutevery 1, 5, 10, 15, 20, 25, or 30 minutes, depending on the number ofmetabolic parameters studied and number of devices used.

Vector r for Metabolome Data

The on-line MFA of the invention uses “rate of change” data, or thedifference between current metabolic measurements and last measurements.The differences are calculated and stored in the “raw measurement”vector for error analysis before they can be used. Thus, in one aspect,a “preprocessing unit” is used to filter out the errors for themeasurement before the metabolic flux analysis to make sure that qualitydata be used. See Example 1, below.

Computer Systems

In one aspect, the methods of the invention use computer-implementedmethods/programs to real time monitor the change in measured metabolicparameters over time. The methods of the invention can be practicedusing any program language or computer/processor and in conjunction withany known software or methodology. For example, one of the programscalled MATHEMATICA™ (Wolfram Research, Inc., Champaign, Ill.), such asMATHEMATICA 4.1™, or variations thereof, can be used, see Example 1,below; and, see also, e.g., Jamshidi (2001) Bioinformatics 17(3):286-287; Wilson (2001) Biophys. Chem. 91(3): 281-304; Torrecilla (2001)J. Neurochem. 76(5): 1291-1307.

The computer/processor used to practice the methods of the invention canbe a conventional general-purpose digital computer, e.g., a personal“workstation” computer, including conventional elements such asmicroprocessor and data transfer bus. The computer/processor can furtherinclude any form of memory elements, such as dynamic random accessmemory, flash memory or the like, or mass storage such as magnetic discoptional storage.

For example, a conventional personal computer such as those based on anIntel microprocessor and running a Windows operating system can be used.Any hardware or software configuration can be used to practice themethods of the invention. For example, computers based on otherwell-known microprocessors and running operating system software such asUNIX, Linux, MacOS and others are contemplated.

EXAMPLES

The following examples are offered to illustrate, but not to limit theclaimed invention.

Example 1 Metabolic Flux Analysis (MFA)

The following example describes implementation of an exemplary MetabolicFlux Analysis (MFA), which is applied in the real time analysis of cellcultures in the methods of the invention. FIG. 1.

Metabolic Flux Analysis (MFA) is important analysis technique ofmetabolic engineering. A flux balance can be written for each metabolite(yi) within a metabolic system to yield the dynamic mass balanceequations that interconnect the various metabolites. Generally, for ametabolic network that contains m compounds and n metabolic fluxes, allthe transient material balances can be represented by a single matrixequation:dY/dt=AX(t)−r(t)

-   -   where    -   Y: m dimensional vector of metabolite amounts per cell    -   X: n metabolic fluxes    -   A: Stoichiometric m×n matrix, and    -   r: vector of specific rates from measurements

The time constants characterizing metabolic transients are typicallyvery rapid compared to the time constants of cell growth and processdynamics, therefore, the mass balances can be simplified to onlyconsider the steady-state behavior. Eliminating the derivative yields:AX(t)=r(t).

Provided that m>=n and A is full rank, the weighted least squaressolution of the above equation is: X=(A^(T)A)⁻¹A^(T)r.

The sensitivity of the solution can be investigated by the matrix:dX/dr=(A^(T)A)⁻¹A^(T).

The elements of the above matrix are useful for the determination of thechange of individual fluxes with respect to the error or perturbation inthe measurements.

Inputs

Stoichiometric Equations

A stoichiometry matrix is derived from the chemical equations to be usedin the analysis. The matrix consists of coefficients of chemical speciesinvolved in the reactions. Rows represent the species and columnsrepresent the equations. For instance, if we consider the equations ofenergy production in cells:2NADH+O2+6ADP→2NAD+2H2O+6ATP2FADH+O2+4ADP→2FAD+2H2O+4ATPATP→ADP

This system yields a stoichiometry matrix with 3 columns and as manyrows as species to be considered in the overall system. In this case, 8species are considered so the NADH −2 0 0 O2 −1 −1 0 NAD 2 0 0 H2O 2 2 0FADH 0 −2 0 FAD 0 2 0 ATP 6 4 −1 ADP −6 −4 1matrix is 3×8.

Using these templates, the stoichiometric matrix is 35×33, and it is inthe EXCEL 97™ file “stoichiex.xls”. This is the matrix ‘A’ describedabove, and it is derived from the 33 chemical equations below.

1. Central Metabolic PathwaysGLC+ATP+NAD→2PYR+ADP+NADH+H2O  1)PYR+NADH→LAC+NAD  2)PYR+NAD→ACCOA+CO2+NADH  3)ACCOA+OAA+NAD+H2O→AKG+CO2+NADH  4)AKG+NAD→SUCCOA+C62+NADH  5)SUCCOA+ADP+H2O+FAD→FUM+ATP+FADH  6)FUM+H2O→MAL  7)MAL+NAD→OAA+NADH  8)GLN+ADP→GLU+NH3+ATP  9)GLU+NAD→AKG+NH3+NADH  10)MAL→PYR+CO2  11)2. Biomass Synthesis: C50.5% H8.31% 032.93% N8.26%0.1016 GLC+0.031 GLN+0.008 ARG+0.0003 ASN+0.001 GLU+0.0038 GLY+0.0028HIS+0.0071 ILE+).008 LEU+).0043 LYS+0.001 MET+0.0152 THR+).0051VAL→BIOMASS  12)3. Amino Acid MetabolismPYR+GLU→ALA+AKG  13)SER→PYR+NH3  14)GLY→SER  15)CYS→PYR+NH3  16)ASP+AKG→OAA+GLU  17)ASN→ASP+NH3  18)HIS→GLU+NH3  19)ARG+AKG→2 GLU  20)PRO→GLU  21)ILE+AKG→SUCCOA+ACCOA+GLU  22)VAL+AKG→GLU+CO2+SUCCOA  23)MET→SUCCOA  24)THR→SUCCOA+NH3  25)PHE→TYR  26)TYR+AKG→GLU+FUM+2 ACCOA  27)LYS+2AKG→2GLU+2CO2+2ACCOA  28)LEU+AKG→GLU+3 ACCOA  29)4. Antibody Formation:1.05 ARG+1.98 ASN+1.96 ASP+1.42 GLU+1.31 GLY+1.59 ILE+3.79 LEU+1.97LYS+0.67 MET+0.95 PHE+5.72 SER 1.32 THR 5.05 TYR+2.68 VAL→Ab  30)5. Energy Production:2NADH+O2+6ADP→2NAD+2H2O+6ATP  31)2 FADH+O2+4 ADP→2 FAD+2H20+4 ATP  32)ATP→ADP  33)

In order to use this matrix with other mathematics software, it must beconverted to a text file. Highlight only the cells that contain numbers,select copy from the Edit menu, and paste into a notepad (or simple texteditor) document, e.g., the “Notepad” text editor program that comeswith Microsoft Windows™ 3.11, 95 and NT. The file can be saved in anotepad as a text file “*.txt”.

Specific Uptake Rates

The specific uptake rates are calculated from data from a cell culturereactor. This data should also be in a text file as a vector of rates,r, that correspond to the appropriate chemical species, i.e. the rows inthe stoichiometry matrix above. In the provided templates, the specificrates are listed in the EXCEL 97™ file “ratex.xls” as well as a textfile (exported from Excel) “rate.txt”.

Calculations

With the inputs in the desired form, it is now time to use a mathematicssoftware package to calculate the estimated internal fluxes. Thissoftware should be able to handle matrix math and differentialequations. One template was made in MATHEMATICA™ 3.0 and is named“mfamath.nb”. The following section assumes that the calculations aredone in MATHEMATICA™ 3.0, but the general procedure can be applied withany suitable package.

Read in Data

First the default directory is set using the SetDirectory command:

-   -   example: SetDirectory [“a:\mfa\”]

The data is then read in and saved into the A matrix (for thestoichiometry matrix) and the r vector (for the specific rates).

-   -   example:A=ReadList [“stoichi.txt, Number, RecordLists-->True]        -   r=ReadList[“rate.txt, Number, RecordLists-->True]

Sensitivity Analysis

Next, the sensitivity matrix (dX/dr) is calculated as (A^(T)A)⁻¹A^(T).

-   -   example: sens=Inverse [Transpose [A]. A]. Transpose [A]

Solution and Error Analysis

The least squares estimation of the flux distributions, x, and theerrors, e, are calculated for the over-determined system of equations.

-   -   example: x=sens.r        -   e=r−A.x

Output of Results

After calculation of the flux estimations, the results must be writtento text files for presentation. In the templates provided, 3 resultstext files are included. These files are “flux.txt” that contains the xvector, “error.txt” that holds the error vector, and “sensitivity.txt”that contains the sensitivity matrix. An example of creating these textfiles in MATHEMATICA™ is shown below.

-   -   Example: a1=Openwrite[“flux.txt”. FormatType->OutputForm];        -   Write[a1, TableForm[x, TableSpacing->{0,1}]]; Close[a1]            Presentation of Results

A critical aspect of this analysis is the efficient and clearpresentation of the large number of estimated fluxes. The output textfiles from MATHEMATICA™ can be imported into Excel, and the solution canbe plotted as a collection of bar graphs.

The EXCEL 97™ file “mfaexc.xls” is the template provided that shows thetable of data and the bar graphs for each flux. It also contains acomposite bar graph that plots the fluxes together and grouped bymetabolic pathway (see below).

An additional way to present the data is to show all the internal fluxesoverlain on a map of the relevant metabolic pathways. The POWERPOINT™template file “mfa.ppt” shows a metabolic map with bar graphs (linked tothe Excel file “mfaexc.xls” which must be opened before the file“mfa.ppt”) to show the magnitude of the fluxes. There exists a linkingbetween the Excel file and the POWERPOINT™ presentation. When the datain Excel is updated, the linking in the presentation should be updated.

Devices to Monitor Organic Acids and Amino Acids

On-line devices that can monitor organic acids and amino acids can alsobe used in practicing the methods of the invention. For example, in oneaspect, the BIO+ON-LINE™ (Lachat Instruments, Milwaukee, Wis.) providesnear-real-time monitoring of fermentation and mammalian cell cultureprocesses. This device can provide critical information to maximizeproduct yields. Mounted on a cart, this device can be rolled up to afermentation bank and connected via a stream selector valve. From there,chemical constituent monitoring occurs automatically for ammonia,glucose, glutamate, glutamine, glycerol, lactate and phosphateindividually and organic acids as a profile employing ion exclusionchromatography. The BIO+ON-LINE™ is an integrated sampling system thatprovides a real solution to this challenging problem using a pumpingsystem combined with a FLOWNAMICS® filter probe which exhibits thefollowing benefits: sterilizable in-place; risk-free sampling due toelimination of bypass filters which recirculate material back into thevessel; sterile, cell-free sampling; accommodates all vessel sizes;minimum dead volume to ensure consistent and accurate sampling and toreduce flush time; durable design and construction to withstandtemperatures, pressures, viscosities, shear forces and chemicalconstituents typical of bioprocess environments.

The BIO+ON-LINE™ can determine up to four analytes simultaneously usingflow injection analysis. The reaction modules can be removed andsubstituted with other modules. Thus, the user can customize the unitfor different fermentation/bioprocess requirements. Additionally, theIon Chromatography channel can be customized to meet other LiquidChromatography (LC) needs. While conductivity detection is the defaultdetector, users can connect UV, RI, or other detectors and their owncolumns to the unit to meet their customized LC separation needs. Thissystem, or variations thereof, is applicable to aerobic and anaerobicbacterial cultures as well as yeast, fungi, algae, insect and mammaliancell cultures.

Other related devices that can be used to practice the invention includethe QUIKCHEM® 8000 (Lachat Instruments, Milwaukee, Wis.) which allowshigh sample throughput coupled with simple and rapid method changeoverto maximize productivity in determining ionic species in a diversity ofsample matrices from sub-ppb to percent concentrations.

One skilled in the art will readily appreciate that the presentinvention is well adapted to carry out the objects and obtain the endsand advantages mentioned as well as those inherent therein. The methodsdescribed herein are presently representative of exemplary aspects andare not intended as limitations on the scope of the invention. Changestherein and other uses will occur to those skilled in the art which areencompassed within the spirit of the invention and are defined by thescope of the claims.

3. MODIFYING: DIRECTED EVOLUTION METHODS

In one aspect the invention described herein is directed to the use ofrepeated cycles of reductive reassortment, recombination and selectionwhich allow for the directed molecular evolution of highly complexlinear sequences, such as DNA, RNA or proteins thorough recombination.

In vivo shuffling of molecules can be performed utilizing the naturalproperty of cells to recombine multimers. While recombination in vivohas provided the major natural route to molecular diversity, geneticrecombination remains a relatively complex process that involves 1) therecognition of homologies; 2) strand cleavage, strand invasion, andmetabolic steps leading to the production of recombinant chiasma; andfinally 3) the resolution of chiasma into discrete recombined molecules.The formation of the chiasma requires the recognition of homologoussequences.

In a preferred embodiment, the invention relates to a method forproducing a hybrid polynucleotide from at least a first polynucleotideand a second polynucleotide. The present invention can be used toproduce a hybrid polynucleotide by introducing at least a firstpolynucleotide and a second polynucleotide which share at least oneregion of partial sequence homology into a suitable host cell. Theregions of partial sequence homology promote processes which result insequence reorganization producing a hybrid polynucleotide. The term“hybrid polynucleotide”, as used herein, is any nucleotide sequencewhich results from the method of the present invention and containssequence from at least two original polynucleotide sequences. Suchhybrid polynucleotides can result from intermolecular recombinationevents which promote sequence integration between DNA molecules. Inaddition, such hybrid polynucleotides can result from intramolecularreductive reassortment processes which utilize repeated sequences toalter a nucleotide sequence within a DNA molecule.

The invention provides a means for generating hybrid polynucleotideswhich may encode biologically active hybrid polypeptides. In one aspect,the original polynucleotides encode biologically active polypeptides.The method of the invention produces new hybrid polypeptides byutilizing cellular processes which integrate the sequence of theoriginal polynucleotides such that the resulting hybrid polynucleotideencodes a polypeptide demonstrating activities derived from the originalbiologically active polypeptides. For example, the originalpolynucleotides may encode a particular enzyme from differentmicroorganisms. An enzyme encoded by a first polynucleotide from oneorganism may, for example, function effectively under a particularenvironmental condition, e.g. high salinity. An enzyme encoded by asecond polynucleotide from a different organism may function effectivelyunder a different environmental condition, such as extremely hightemperatures. A hybrid polynucleotide containing sequences from thefirst and second original polynucleotides may encode an enzyme whichexhibits characteristics of both enzymes encoded by the originalpolynucleotides. Thus, the enzyme encoded by the hybrid polynucleotidemay function effectively under environmental conditions shared by eachof the enzymes encoded by the first and second polynucleotides, e.g.,high salinity and extreme temperatures.

Enzymes encoded by the original polynucleotides of the inventioninclude, but are not limited to; oxidoreductases, transferases,hydrolases, lyases, isomerases and ligases. A hybrid polypeptideresulting from the method of the invention may exhibit specializedenzyme activity not displayed in the original enzymes. For example,following recombination and/or reductive reassortment of polynucleotidesencoding hydrolase activities, the resulting hybrid polypeptide encodedby a hybrid polynucleotide can be screened for specialized hydrolaseactivities obtained from each of the original enzymes, i.e. the type ofbond on which the hydrolase acts and the temperature at which thehydrolase functions. Thus, for example, the hydrolase may be screened toascertain those chemical functionalities which distinguish the hybridhydrolase from the original hydrolyases, such as: (a) amide (peptidebonds), i.e. proteases; (b) ester bonds, i.e. esterases and lipases; (c)acetals, i.e., glycosidases and, for example, the temperature, pH orsalt concentration at which the hybrid polypeptide functions.

Sources of the original polynucleotides may be isolated from individualorganisms (“isolates”), collections of organisms that have been grown indefined media (“enrichment cultures”), or, most preferably, uncultivatedorganisms (“environmental samples”). The use of a culture-independentapproach to derive polynucleotides encoding novel bioactivities fromenvironmental samples is most preferable since it allows one to accessuntapped resources of biodiversity.

“Environmental libraries” are generated from environmental samples andrepresent the collective genomes of naturally occurring organismsarchived in cloning vectors that can be propagated in suitableprokaryotic hosts. Because the cloned DNA is initially extracteddirectly from environmental samples, the libraries are not limited tothe small fraction of prokaryotes that can be grown in pure culture.Additionally, a normalization of the environmental DNA present in thesesamples could allow more equal representation of the DNA from all of thespecies present in the original sample. This can dramatically increasethe efficiency of finding interesting genes from minor constituents ofthe sample which may be under-represented by several orders of magnitudecompared to the dominant species.

For example, gene libraries generated from one or more uncultivatedmicroorganisms are screened for an activity of interest. Potentialpathways encoding bioactive molecules of interest are first captured inprokaryotic cells in the form of gene expression libraries.Polynucleotides encoding activities of interest are isolated from suchlibraries and introduced into a host cell. The host cell is grown underconditions which promote recombination and/or reductive reassortmentcreating potentially active biomolecules with novel or enhancedactivities.

The microorganisms from which the polynucleotide may be prepared includeprokaryotic microorganisms, such as Eubacteria and Archaebacteria, andlower eukaryotic microorganisms such as fungi, some algae and protozoa.Polynucleotides may be isolated from environmental samples in which casethe nucleic acid may be recovered without culturing of an organism orrecovered from one or more cultured organisms. In one aspect, suchmicroorganisms may be extremophiles, such as hyperthermophiles,psychrophiles, psychrotrophs, halophiles, barophiles and acidophiles.Polynucleotides encoding enzymes isolated from extremophilicmicroorganisms are particularly preferred. Such enzymes may function attemperatures above 100° C. in terrestrial hot springs and deep seathermal vents, at temperatures below 0° C. in arctic waters, in thesaturated salt environment of the Dead Sea, at pH values around 0 incoal deposits and geothermal sulfur-rich springs, or at pH valuesgreater than 11 in sewage sludge. For example, several esterases andlipases cloned and expressed from extremophilic organisms show highactivity throughout a wide range of temperatures and pHs.

Polynucleotides selected and isolated as hereinabove described areintroduced into a suitable host cell. A suitable host cell is any cellwhich is capable of promoting recombination and/or reductivereassortment. The selected polynucleotides are preferably already in avector which includes appropriate control sequences. The host cell canbe a higher eukaryotic cell, such as a mammalian cell, or a lowereukaryotic cell, such as a yeast cell, or preferably, the host cell canbe a prokaryotic cell, such as a bacterial cell. Introduction of theconstruct into the host cell can be effected by calcium phosphatetransfection, DEAE-Dextran mediated transfection, or electroporation(Davis et al, 1986).

As representative examples of appropriate hosts, there may be mentioned:bacterial cells, such as E. coli, Streptomyces, Salmonella typhimurium;fungal cells, such as yeast; insect cells such as Drosophila S2 andSpodoptera S19; animal cells such as CHO, COS or Bowes melanoma;adenoviruses; and plant cells. The selection of an appropriate host isdeemed to be within the scope of those skilled in the art from theteachings herein.

With particular references to various mammalian cell culture systemsthat can be employed to express recombinant protein, examples ofmammalian expression systems include the COS-7 lines of monkey kidneyfibroblasts, described in “SV40-transformed simian cells support thereplication of early SV40 mutants” (Gluzman, 1981), and other cell linescapable of expressing a compatible vector, for example, the C127, 3T3,CHO, HeLa and BHK cell lines. Mammalian expression vectors will comprisean origin of replication, a suitable promoter and enhancer, and also anynecessary ribosome binding sites, polyadenylation site, splice donor andacceptor sites, transcriptional termination sequences, and 5′ flankingnontranscribed sequences. DNA sequences derived from the SV40 splice,and polyadenylation sites may be used to provide the requirednontranscribed genetic elements.

Host cells containing the polynucleotides of interest can be cultured inconventional nutrient media modified as appropriate for activatingpromoters, selecting transformants or amplifying genes. The cultureconditions, such as temperature, pH and the like, are those previouslyused with the host cell selected for expression, and will be apparent tothe ordinarily skilled artisan. The clones which are identified ashaving the specified enzyme activity may then be sequenced to identifythe polynucleotide sequence encoding an enzyme having the enhancedactivity.

In another aspect, it is envisioned the method of the present inventioncan be used to generate novel polynucleotides encoding biochemicalpathways from one or more operons or gene clusters or portions thereof.For example, bacteria and many eukaryotes have a coordinated mechanismfor regulating genes whose products are involved in related processes.The genes are clustered, in structures referred to as “gene clusters,”on a single chromosome and are transcribed together under the control ofa single regulatory sequence, including a single promoter whichinitiates transcription of the entire cluster. Thus, a gene cluster is agroup of adjacent genes that are either identical or related, usually asto their function. An example of a biochemical pathway encoded by geneclusters are polyketides. Polyketides are molecules which are anextremely rich source of bioactivities, including antibiotics (such astetracyclines and erythromycin), anti-cancer agents (daunomycin),immunosuppressants (FK506 and rapamycin), and veterinary products(monensin). Many polyketides (produced by polyketide synthases) arevaluable as therapeutic agents. Polyketide synthases are multifunctionalenzymes that catalyze the biosynthesis of an enormous variety of carbonchains differing in length and patterns of functionality andcyclization. Polyketide synthase genes fall into gene clusters and atleast one type (designated type 1) of polyketide synthases have largesize genes and enzymes, complicating genetic manipulation and in vitrostudies of these genes/proteins.

The ability to select and combine desired components from a library ofpolyketides, or fragments thereof, and postpolyketide biosynthesis genesfor generation of novel polyketides for study is appealing. The methodof the present invention makes it possible to facilitate the productionof novel polyketide synthases through intermolecular recombination.

Preferably, gene cluster DNA can be isolated from different organismsand ligated into vectors, particularly vectors containing expressionregulatory sequences which can control and regulate the production of adetectable protein or protein-related array activity from the ligatedgene clusters. Use of vectors which have an exceptionally large capacityfor exogenous DNA introduction are particularly appropriate for use withsuch gene clusters and are described by way of example herein to includethe f-factor (or fertility factor) of E. coli. This f-factor of E. coliis a plasmid which affect high-frequency transfer of itself duringconjugation and is ideal to achieve and stably propagate large DNAfragments, such as gene clusters from mixed microbial samples. Onceligated into an appropriate vector, two or more vectors containingdifferent polyketide synthase gene clusters can be introduced into asuitable host cell. Regions of partial sequence homology shared by thegene clusters will promote processes which result in sequencereorganization resulting in a hybrid gene cluster. The novel hybrid genecluster can then be screened for enhanced activities not found in theoriginal gene clusters.

Therefore, in a preferred embodiment, the present invention relates to amethod for producing a biologically active hybrid polypeptide andscreening such a polypeptide for enhanced activity by:

-   -   1) introducing at least a first polynucleotide in operable        linkage and a second polynucleotide in operable linkage, said at        least first polynucleotide and second polynucleotide sharing at        least one region of partial sequence homology, into a suitable        host cell;    -   2) growing the host cell under conditions which promote sequence        reorganization resulting in a hybrid polynucleotide in operable        linkage;    -   3) expressing a hybrid polypeptide encoded by the hybrid        polynucleotide;    -   4) screening the hybrid polypeptide under conditions which        promote identification of enhanced biological activity; and    -   5) isolating the a polynucleotide encoding the hybrid        polypeptide.

Methods for screening for various enzyme activities are known to thoseof skill in the art and discussed throughout the present specification.Such methods may be employed when isolating the polypeptides andpolynucleotides of the present invention.

As representative examples of expression vectors which may be used theremay be mentioned viral particles, baculovirus, phage, plasmids,phagemids, cosmids, fosmids, bacterial artificial chromosomes, viral DNA(e.g. vaccinia, adenovirus, foul pox virus, pseudorabies and derivativesof SV40), P1-based artificial chromosomes, yeast plasmids, yeastartificial chromosomes, and any other vectors specific for specifichosts of interest (such as bacillus, aspergillus and yeast). Thus, forexample, the DNA may be included in any one of a variety of expressionvectors for expressing a polypeptide. Such vectors include chromosomal,nonchromosomal and synthetic DNA sequences. Large numbers of suitablevectors are known to those of skill in the art, and are commerciallyavailable. The following vectors are provided by way of example;Bacterial: pQE vectors (Qiagen), pBluescript plasmids, pNH vectors,(lambda-Z^(A)P vectors (Stratagene); ptrc99a, pKK223-3, pDR540, pRIT2T(Pharmacia); Eukaryotic: pXT1, pSG5 (Stratagene), pSVK3, pBPV, pMSG,pSVLSV40 (Pharmacia). However, any other plasmid or other vector may beused as long as they are replicable and viable in the host. Low copynumber or high copy number vectors may be employed with the presentinvention.

A preferred type of vector for use in the present invention contains anf-factor origin replication. The f-factor (or fertility factor) in E.coli is a plasmid which effects high frequency transfer of itself duringconjugation and less frequent transfer of the bacterial chromosomeitself. A particularly preferred embodiment is to use cloning vectors,referred to as “fosmids” or bacterial artificial chromosome (BAC)vectors. These are derived from E. coli f-factor which is able to stablyintegrate large segments of genomic DNA. When integrated with DNA from amixed uncultured environmental sample, this makes it possible to achievelarge genomic fragments in the form of a stable “environmental DNAlibrary.”

Another preferred type of vector for use in the present invention isshuttle vector that is optimized for the expression of genes and geneclusters. Such systems may include but are not limited to shuttlingsystems that shuttle between E. coli and another bacteria such asStreptomyces. Another preferred type of vector for use in the presentinvention is a cosmid vector. Cosmid vectors were originally designed toclone and propagate large segments of genomic DNA. Cloning into cosmidvectors is described in detail in “Molecular Cloning A laboratoryManual” (Sambrook et al, 1989).

The DNA sequence in the expression vector is operatively linked to anappropriate expression control sequence(s) (promoter) to direct RNAsynthesis. Particular named bacterial promoters include lac, lacZ, T3,T7, gpt, lambda P_(R), P_(L) and trp. Eukaryotic promoters include CMVimmediate early, HSV thymidine kinase, early and late SV40, LTRs fromretrovirus, and mouse metallothionein-1. Selection of the appropriatevector and promoter is well within the level of ordinary skill in theart. The expression vector also contains a ribosome binding site fortranslation initiation and a transcription terminator. The vector mayalso include appropriate sequences for amplifying expression. Promoterregions can be selected from any desired gene using CAT (chloramphenicoltransferase) vectors or other vectors with selectable markers.

In addition, the expression vectors preferably contain one or moreselectable marker genes to provide a phenotypic trait for selection oftransformed host cells such as dihydrofolate reductase or neomycinresistance for eukaryotic cell culture, or such as tetracycline orampicillin resistance in E. coli.

Generally, recombinant expression vectors will include origins ofreplication and selectable markers permitting transformation of the hostcell, e.g., the ampicillin resistance gene of E. coli and S. cerevisiaeTRP1 gene, and a promoter derived from a highly-expressed gene to directtranscription of a downstream structural sequence. Such promoters can bederived from operons encoding glycolytic enzymes such as3-phosphoglycerate kinase (PGK), -factor, acid phosphatase, or heatshock proteins, among others. The heterologous structural sequence isassembled in appropriate phase with translation initiation andtermination sequences, and preferably, a leader sequence capable ofdirecting secretion of translated protein into the periplasmic space orextracellular medium.

The cloning strategy permits expression via both vector driven andendogenous promoters; vector promotion may be important with expressionof genes whose endogenous promoter will not function in E. coli.

The DNA isolated or derived from microorganisms can preferably beinserted into a vector or a plasmid prior to probing for selected DNA.Such vectors or plasmids are preferably those containing expressionregulatory sequences, including promoters, enhancers and the like. Suchpolynucleotides can be part of a vector and/or a composition and stillbe isolated, in that such vector or composition is not, part of itsnatural environment. Particularly preferred phage or plasmid and methodsfor introduction and packaging into them are described in detail in theprotocol set forth herein.

The selection of the cloning vector depends upon the approach taken, forexample, the vector can be any cloning vector with an adequate capacityto multiply repeated copies of a sequence, or multiple sequences thatcan be successfully transformed and selected in a host cell. One exampleof such a vector is described in “Polycos vectors: a system forpackaging filamentous phage and phagemid vectors using lambda phagepackaging extracts” (Alting-Mecs and Short, 1993).Propagation/maintenance can be by an antibiotic resistance carried bythe cloning vector. After a period of growth, the naturally abbreviatedmolecules are recovered and identified by size fractionation on a gel orcolumn, or amplified directly. The cloning vector utilized may contain aselectable gene that is disrupted by the insertion of the lengthyconstruct. As reductive reassortment progresses, the number of repeatedunits is reduced and the interrupted gene is again expressed and henceselection for the processed construct can be applied. The vector may bean expression/selection vector which will allow for the selection of anexpressed product possessing desirable biologically properties. Theinsert may be positioned downstream of a functional promotor and thedesirable property screened by appropriate means.

In vivo reassortment is focused on “inter-molecular” processescollectively referred to as “recombination” which in bacteria, isgenerally viewed as a “RecA-dependent” phenomenon. The present inventioncan rely on recombination processes of a host cell to recombine andre-assort sequences, or the cells' ability to mediate reductiveprocesses to decrease the complexity of quasi-repeated sequences in thecell by deletion. This process of “reductive reassortment” occurs by an“intra-molecular”, RecA-independent process.

Therefore, in another aspect of the present invention, novelpolynucleotides can be generated by the process of reductivereassortment. The method involves the generation of constructscontaining consecutive sequences (original encoding sequences), theirinsertion into an appropriate vector, and their subsequent introductioninto an appropriate host cell. The reassortment of the individualmolecular identities occurs by combinatorial processes between theconsecutive sequences in the construct possessing regions of homology,or between quasi-repeated units. The reassortment process recombinesand/or reduces the complexity and extent of the repeated sequences, andresults in the production of novel molecular species. Various treatmentsmay be applied to enhance the rate of reassortment. These could includetreatment with ultra-violet light, or DNA damaging chemicals, and/or theuse of host cell lines displaying enhanced levels of “geneticinstability”. Thus the reassortment process may involve homologousrecombination or the natural property of quasi-repeated sequences todirect their own evolution.

Repeated or “quasi-repeated” sequences play a role in geneticinstability. In the present invention, “quasi-repeats” are repeats thatare not restricted to their original unit structure. Quasi-repeatedunits can be presented as an array of sequences in a construct;consecutive units of similar sequences. Once ligated, the junctionsbetween the consecutive sequences become essentially invisible and thequasi-repetitive nature of the resulting construct is now continuous atthe molecular level. The deletion process the cell performs to reducethe complexity of the resulting construct operates between thequasi-repeated sequences. The quasi-repeated units provide a practicallylimitless repertoire of templates upon which slippage events can occur.The constructs containing the quasi-repeats thus effectively providesufficient molecular elasticity that deletion (and potentiallyinsertion) events can occur virtually anywhere within thequasi-repetitive units.

When the quasi-repeated sequences are all ligated in the sameorientation, for instance head to tail or vice versa, the cell cannotdistinguish individual units. Consequently, the reductive process canoccur throughout the sequences. In contrast, when for example, the unitsare presented head to head, rather than head to tail, the inversiondelineates the endpoints of the adjacent unit so that deletion formationwill favor the loss of discrete units. Thus, it is preferable with thepresent method that the sequences are in the same orientation. Randomorientation of quasi-repeated sequences will result in the loss ofreassortment efficiency, while consistent orientation of the sequenceswill offer the highest efficiency. However, while having fewer of thecontiguous sequences in the same orientation decreases the efficiency,it may still provide sufficient elasticity for the effective recovery ofnovel molecules. Constructs can be made with the quasi-repeatedsequences in the same orientation to allow higher efficiency.

Sequences can be assembled in a head to tail orientation using any of avariety of methods, including the following:

-   -   a) Primers that include a poly-A head and poly-T tail which when        made single-stranded would provide orientation can be utilized.        This is accomplished by having the first few bases of the        primers made from RNA and hence easily removed RNAseH.    -   b) Primers that include unique restriction cleavage sites can be        utilized. Multiple sites, a battery of unique sequences, and        repeated synthesis and ligation steps would be required.    -   c) The inner few bases of the primer could be thiolated and an        exonuclease used to produce properly tailed molecules.

The recovery of the re-assorted sequences relies on the identificationof cloning vectors with a reduced RI. The re-assorted encoding sequencescan then be recovered by amplification. The products are re-cloned andexpressed. The recovery of cloning vectors with reduced RI can beeffected by:

-   1) The use of vectors only stably maintained when the construct is    reduced in complexity.-   2) The physical recovery of shortened vectors by physical    procedures. In this case, the cloning vector would be recovered    using standard plasmid isolation procedures and size fractionated on    either an agarose gel, or column with a low molecular weight cut off    utilizing standard procedures.-   3) The recovery of vectors containing interrupted genes which can be    selected when insert size decreases.

4) The use of direct selection techniques with an expression vector andthe appropriate selection.

Encoding sequences (for example, genes) from related organisms maydemonstrate a high degree of homology and encode quite diverse proteinproducts. These types of sequences are particularly useful in thepresent invention as quasi-repeats. However, while the examplesillustrated below demonstrate the reassortment of nearly identicaloriginal encoding sequences (quasi-repeats), this process is not limitedto such nearly identical repeats.

The following example demonstrates the method of the invention. Encodingnucleic acid sequences (quasi-repeats) derived from three (3) uniquespecies are depicted. Each sequence encodes a protein with a distinctset of properties. Each of the sequences differs by a single or a fewbase pairs at a unique position in the sequence which are designated“A”, “B” and “C”. The quasi-repeated sequences are separately orcollectively amplified and ligated into random assemblies such that allpossible permutations and combinations are available in the populationof ligated molecules. The number of quasi-repeat units can be controlledby the assembly conditions. The average number of quasi-repeated unitsin a construct is defined as the repetitive index (RI).

Once formed, the constructs may, or may not be size fractionated on anagarose gel according to published protocols, inserted into a cloningvector, and transfected into an appropriate host cell. The cells arethen propagated and “reductive reassortment” is effected. The rate ofthe reductive reassortment process may be stimulated by the introductionof DNA damage if desired. Whether the reduction in RI is mediated bydeletion formation between repeated sequences by an “intra-molecular”mechanism, or mediated by recombination-like events through“inter-molecular” mechanisms is immaterial. The end result is areassortment of the molecules into all possible combinations.

Optionally, the method comprises the additional step of screening thelibrary members of the shuffled pool to identify individual shuffledlibrary members having the ability to bind or otherwise interact (e.g.,such as catalytic antibodies) with a predetermined macromolecule, suchas for example a proteinaceous receptor, peptide oligosaccharide, viron,or other predetermined compound or structure.

The displayed polypeptides, antibodies, peptidomimetic antibodies, andvariable region sequences that are identified from such libraries can beused for therapeutic, diagnostic, research and related purposes (e.g.,catalysts, solutes for increasing osmolarity of an aqueous solution, andthe like), and/or can be subjected to one or more additional cycles ofshuffling and/or affinity selection. The method can be modified suchthat the step of selecting for a phenotypic characteristic can be otherthan of binding affinity for a predetermined molecule (e.g., forcatalytic activity, stability oxidation resistance, drug resistance, ordetectable phenotype conferred upon a host cell).

The present invention provides a method for generating libraries ofdisplayed antibodies suitable for affinity interactions screening. Themethod comprises (1) obtaining first a plurality of selected librarymembers comprising a displayed antibody and an associated polynucleotideencoding said displayed antibody, and obtaining said associatedpolynucleotide encoding for said displayed antibody and obtaining saidassociated polynucleotides or copies thereof, wherein said associatedpolynucleotides comprise a region of substantially identical variableregion framework sequence, and (2) introducing said polynucleotides intoa suitable host cell and growing the cells under conditions whichpromote recombination and reductive reassortment resulting in shuffledpolynucleotides. CDR combinations comprised by the shuffled pool are notpresent in the first plurality of selected library members, saidshuffled pool composing a library of displayed antibodies comprising CDRpermutations and suitable for affinity interaction screening.Optionally, the shuffled pool is subjected to affinity screening toselect shuffled library members which bind to a predetermined epitope(antigen) and thereby selecting a plurality of selected shuffled librarymembers. Further, the plurality of selectively shuffled library memberscan be shuffled and screened iteratively, from 1 to about 1000 cycles oras desired until library members having a desired binding affinity areobtained.

In another aspect of the invention, it is envisioned that prior to orduring recombination or reassortment, polynucleotides generated by themethod of the present invention can be subjected to agents or processeswhich promote the introduction of mutations into the originalpolynucleotides. The introduction of such mutations would increase thediversity of resulting hybrid polynucleotides and polypeptides encodedtherefrom. The agents or processes which promote mutagenesis caninclude, but are not limited to: (+)-CC-1065, or a synthetic analog suchas (+)-CC-1065-(N-3-Adenine, see Sun and Hurley, 1992); an N-acelylatedor deacetylated 4′-fluro-4-aminobiphenyl adduct capable of inhibitingDNA synthesis (see, for example, van de Poll et al, 1992); or aN-acetylated or deacetylated 4-aminobiphenyl adduct capable ofinhibiting DNA synthesis (see also, van de Poll et al, 1992, pp.751-758); trivalent chromium, a trivalent chromium salt, a polycyclicaromatic hydrocarbon (“PAH”) DNA adduct capable of inhibiting DNAreplication, such as 7-bromomethyl-benz[a]anthracene (“BMA”),tris(2,3-dibromopropyl)phosphate (“Tris-BP”),1,2-dibromo-3-chloropropane (“DBCP”), 2-bromoacrolein (2BA),benzo[a]pyrene-7,8-dihydrodiol-9-10-epoxide (“BPDE”), a platinum(II)halogen salt, N-hydroxy-2-amino-3-methylimidazo[4,5-f]-quinoline(“N-hydroxy-IQ”), andN-hydroxy-2-amino-1-methyl-6-phenylimidazo[4,5-f]-pyridine(“N-hydroxy-PhIP”). Especially preferred “means for slowing or haltingPCR amplification consist of UV light (+)-CC-1065 and(+)-CC-1065-(N-3-Adenine). Particularly encompassed means are DNAadducts or polynucleotides comprising the DNA adducts from thepolynucleotides or polynucleotides pool, which can be released orremoved by a process including heating the solution comprising thepolynucleotides prior to further processing.

In another aspect, this invention provides for using UV light tomutagenize polynucleotides. One use of such a technique is as follows:one microgram samples of template DNA are obtained and treated with U.V.light to cause the formation of dimers, including TT dimers,particularly purine dimers. U.V. exposure is limited so that only a fewphotoproducts are generated per gene on the template DNA sample.Multiple samples are treated with U.V. light for varying periods of timeto obtain template DNA samples with varying numbers of dimers from U.V.exposure. A random priming kit which utilizes a non-proofreadingpolymease (for example, Prime-It II Random Primer Labeling kit byStratagene Cloning Systems) is utilized to generate different sizepolynucleotides by priming at random sites on templates which areprepared by U.V. light (as described above) and extending along thetemplates. The priming protocols such as described in the Prime-It IIRandom Primer Labeling kit may be utilized to extend the primers. Thedimers formed by U.V. exposure serve as a roadblock for the extension bythe non-proofreading polymerase. Thus, a pool of random sizepolynucleotides is present after extension with the random primers isfinished.

In another aspect the present invention is directed to a method ofproducing recombinant proteins having biological activity by treating asample comprising double-stranded template polynucleotides encoding awild-type protein under conditions according to the present inventionwhich provide for the production of hybrid or re-assortedpolynucleotides.

The invention also provides the use of polynucleotide shuffling toshuffle a population of viral genes (e.g., capsid proteins, spikeglycoproteins, polymerases, and proteases) or viral genomes (e.g.,paramyxoviridae, orthomyxoviridae, herpesviruses, retroviruses,reoviruses and rhinoviruses). In an embodiment, the invention provides amethod for shuffling sequences encoding all or portions of immunogenicviral proteins to generate novel combinations of epitopes as well asnovel epitopes created by recombination; such shuffled viral proteinsmay comprise epitopes or combinations of epitopes as well as novelepitopes created by recombination; such shuffled viral proteins maycomprise epitopes or combinations of epitopes which are likely to arisein the natural environment as a consequence of viral evolution; (e.g.,such as recombination of influenza virus strains).

The invention also provides a method suitable for shufflingpolynucleotide sequences for generating gene therapy vectors andreplication-defective gene therapy constructs, such as may be used forhuman gene therapy, including but not limited to vaccination vectors forDNA-based vaccination, as well as anti-neoplastic gene therapy and othergeneral therapy formats.

In the polypeptide notation used herein, the left-hand direction is theamino terminal direction and the right-hand direction is thecarboxy-terminal direction, in accordance with standard usage andconvention. Similarly, unless specified otherwise, the left-hand end ofsingle-stranded polynucleotide sequences is the 5′ end; the left-handdirection of double-stranded polynucleotide sequences is referred to asthe 5′ direction. The direction of 5′ to 3′ addition of nascent RNAtranscripts is referred to as the transcription direction; sequenceregions on the DNA strand having the same sequence as the RNA and whichare 5′ to the 5′ end of the RNA transcript are referred to as “upstreamsequences”; sequence regions on the DNA strand having the same sequenceas the RNA and which are 3′ to the 3′ end of the coding RNA transcriptare referred to as “downstream sequences”.

3.1. Saturation Mutagenesis

In one aspect, this invention provides for the use of proprietary codonprimers (containing a degenerate N,N,G/T sequence) to introduce pointmutations into a polynucleotide, so as to generate a set of progenypolypeptides in which a full range of single amino acid substitutions isrepresented at each amino acid position. The oligos used are comprisedcontiguously of a first homologous sequence, a degenerate N,N,G/Tsequence, and preferably but not necessarily a second homologoussequence. The downstream progeny translational products from the use ofsuch oligos include all possible amino acid changes at each amino acidsite along the polypeptide, because the degeneracy of the N,N,G/Tsequence includes codons for all 20 amino acids.

In one aspect, one such degenerate oligo (comprised of one degenerateN,N,G/T cassette) is used for subjecting each original codon in aparental polynucleotide template to a full range of codon substitutions.In another aspect, at least two degenerate N,N,G/T cassettes areused—either in the same oligo or not, for subjecting at least twooriginal codons in a parental polynucleotide template to a full range ofcodon substitutions. Thus, more than one N,N,G/T sequence can becontained in one oligo to introduce amino acid mutations at more thanone site. This plurality of N,N,G/T sequences can be directlycontiguous, or separated by one or more additional nucleotidesequence(s). In another aspect, oligos serviceable for introducingadditions and deletions can be used either alone or in combination withthe codons containing an N,N,G/T sequence, to introduce any combinationor permutation of amino acid additions, deletions, and/or substitutions.

In a particular exemplification, it is possible to simultaneouslymutagenize two or more contiguous amino acid positions using an oligothat contains contiguous N,N,G/T triplets, i.e. a degenerate(N,N,G/T)_(n) sequence.

In another aspect, the present invention provides for the use ofdegenerate cassettes having less degeneracy than the N,N,G/T sequence.For example, it may be desirable in some instances to use (e.g. in anoligo) a degenerate triplet sequence comprised of only one N, where saidN can be in the first second or third position of the triplet. Any otherbases including any combinations and permutations thereof can be used inthe remaining two positions of the triplet. Alternatively, it may bedesirable in some instances to use (e.g. in an oligo) a degenerate N,N,Ntriplet sequence, or an N,N, G/C triplet sequence.

It is appreciated, however, that the use of a degenerate triplet (suchas N,N,G/T or an N,N, G/C triplet sequence) as disclosed in the instantinvention is advantageous for several reasons. In one aspect, thisinvention provides a means to systematically and fairly easily generatethe substitution of the full range of possible amino acids (for a totalof 20 amino acids) into each and every amino acid position in apolypeptide. Thus, for a 100 amino acid polypeptide, the instantinvention provides a way to systematically and fairly easily generate2000 distinct species (i.e. 20 possible amino acids per position X 100amino acid positions). It is appreciated that there is provided, throughthe use of an oligo containing a degenerate N,N,G/T or an N,N, G/Ctriplet sequence, 32 individual sequences that code for 20 possibleamino acids. Thus, in a reaction vessel in which a parentalpolynucleotide sequence is subjected to saturation mutagenesis using onesuch oligo, there are generated 32 distinct progeny polynucleotidesencoding 20 distinct polypeptides. In contrast, the use of anon-degenerate oligo in site-directed mutagenesis leads to only oneprogeny polypeptide product per reaction vessel.

This invention also provides for the use of nondegenerate oligos, whichcan optionally be used in combination with degenerate primers disclosed.It is appreciated that in some situations, it is advantageous to usenondegenerate oligos to generate specific point mutations in a workingpolynucleotide. This provides a means to generate specific silent pointmutations, point mutations leading to corresponding amino acid changes,and point mutations that cause the generation of stop codons and thecorresponding expression of polypeptide fragments.

Thus, in a preferred embodiment of this invention, each saturationmutagenesis reaction vessel contains polynucleotides encoding at least20 progeny polypeptide molecules such that all 20 amino acids arerepresented at the one specific amino acid position corresponding to thecodon position mutagenized in the parental polynucleotide. The 32-folddegenerate progeny polypeptides generated from each saturationmutagenesis reaction vessel can be subjected to clonal amplification(e.g. cloned into a suitable E. Coli host using an expression vector)and subjected to expression screening. When an individual progenypolypeptide is identified by screening to display a favorable change inproperty (when compared to the parental polypeptide), it can besequenced to identify the correspondingly favorable amino acidsubstitution contained therein.

It is appreciated that upon mutagenizing each and every amino acidposition in a parental polypeptide using saturation mutagenesis asdisclosed herein, favorable amino acid changes may be identified at morethan one amino acid position. One or more new progeny molecules can begenerated that contain a combination of all or part of these favorableamino acid substitutions. For example, if 2 specific favorable aminoacid changes are identified in each of 3 amino acid positions in apolypeptide, the permutations include 3 possibilities at each position(no change from the original amino acid, and each of two favorablechanges) and 3 positions. Thus, there are 3×3×3 or 27 totalpossibilities, including 7 that were previously examined—6 single pointmutations (i.e. 2 at each of three positions) and no change at anyposition.

In yet another aspect, site-saturation mutagenesis can be used togetherwith shuffling, chimerization, recombination and other mutagenizingprocesses, along with screening. This invention provides for the use ofany mutagenizing process(es), including saturation mutagenesis, in aniterative manner. In one exemplification, the iterative use of anymutagenizing process(es) is used in combination with screening.

Thus, in a non-limiting exemplification, this invention provides for theuse of saturation mutagenesis in combination with additionalmutagenization processes, such as process where two or more relatedpolynucleotides are introduced into a suitable host cell such that ahybrid polynucleotide is generated by recombination and reductivereassortment.

In addition to performing mutagenesis along the entire sequence of agene, the instant invention provides that mutagenesis can be use toreplace each of any number of bases in a polynucleotide sequence,wherein the number of bases to be mutagenized is preferably everyinteger from 15 to 100,000. Thus, instead of mutagenizing every positionalong a molecule, one can subject every a discrete number of bases(preferably a subset totaling from 15 to 100,000) to mutagenesis.Preferably, a separate nucleotide is used for mutagenizing each positionor group of positions along a polynucleotide sequence. A group of 3positions to be mutagenized may be a codon. The mutations are preferablyintroduced using a mutagenic primer, containing a heterologous cassette,also referred to as a mutagenic cassette. Preferred cassettes can havefrom 1 to 500 bases. Each nucleotide position in such heterologouscassettes be N, A, C, G, T, A/C, A/G, A/f, C/G, C/T, G/T, C/G/T, A/G/T,A/C/T, A/C/G, or E, where E is any base that is not A, C, G, or T (E canbe referred to as a designer oligo). The tables below show exemplarytri-nucleotide cassettes (there are over 3000 possibilities in additionto N,N,G/T and N,N,N and N,N,A/C).

In a general sense, saturation mutagenesis is comprised of mutagenizinga complete set of mutagenic cassettes (wherein each cassette ispreferably 1-500 bases in length) in defined polynucleotide sequence tobe mutagenized (wherein the sequence to be mutagenized is preferablyfrom 15 to 100,000 bases in length). Thusly, a group of mutations(ranging from 1 to 100 mutations) is introduced into each cassette to bemutagenized. A grouping of mutations to be introduced into one cassettecan be different or the same from a second grouping of mutations to beintroduced into a second cassette during the application of one round ofsaturation mutagenesis. Such groupings are exemplified by deletions,additions, groupings of particular codons, and groupings of particularnucleotide cassettes.

Defined sequences to be mutagenized (see FIG. 20) include preferably awhole gene, pathway, cDNA, an entire open reading frame (ORF), andentire promoter, enhancer, repressor/transactivator, origin ofreplication, intron, operator, or any polynucleotide functional group.Generally, a preferred “defined sequences” for this purpose may be anypolynucleotide that a 15 base-polynucleotide sequence, andpolynucleotide sequences of lengths between 15 bases and 15,000 bases(this invention specifically names every integer in between).Considerations in choosing groupings of codons include types of aminoacids encoded by a degenerate mutagenic cassette.

In a particularly preferred exemplification a grouping of mutations thatcan be introduced into a mutagenic cassette (see Tables 1-85), thisinvention specifically provides for degenerate codon substitutions(using degenerate oligos) that code for 2, 3, 4, 5, 6, 7, 8, 9, 10, 11,12, 13, 14, 15, 16, 17, 18, 19, and 20 amino acids at each position, anda library of polypeptides encoded thereby.

3.2. Chimerizations

3.2.1 “Shuffling”

Nucleic acid shuffling is a method for in vitro or in vivo homologousrecombination of pools of shorter or smaller polynucleotides to producea polynucleotide or polynucleotides. Mixtures of related nucleic acidsequences or polynucleotides are subjected to sexual PCR to providerandom polynucleotides, and reassembled to yield a library or mixedpopulation of recombinant hybrid nucleic acid molecules orpolynucleotides.

In contrast to cassette mutagenesis, only shuffling and error-prone PCRallow one to mutate a pool of sequences blindly (without sequenceinformation other than primers).

The advantage of the mutagenic shuffling of this invention overerror-prone PCR alone for repeated selection can best be explained withan example from antibody engineering. Consider DNA shuffling as comparedwith error-prone PCR (not sexual PCR). The initial library of selectedpooled sequences can consist of related sequences of diverse origin(i.e. antibodies from naive mRNA) or can be derived by any type ofmutagenesis (including shuffling) of a single antibody gene. Acollection of selected complementarity determining regions (“CDRs”) isobtained after the first round of affinity selection. In the diagram thethick CDRs confer onto the antibody molecule increased affinity for theantigen. Shuffling allows the free combinatorial association of all ofthe CDR1s with all of the CDR2s with all of the CDR3s, for example.

This method differs from error-prone PCR, in that it is an inverse chainreaction. In error-prone PCR, the number of polymerase start sites andthe number of molecules grows exponentially. However, the sequence ofthe polymerase start sites and the sequence of the molecules remainsessentially the same. In contrast, in nucleic acid reassembly orshuffling of random polynucleotides the number of start sites and thenumber (but not size) of the random polynucleotides decreases over time.For polynucleotides derived from whole plasmids the theoretical endpointis a single, large concatemeric molecule.

Since cross-overs occur at regions of homology, recombination willprimarily occur between members of the same sequence family. Thisdiscourages combinations of CDRs that are grossly incompatible (e.g.,directed against different epitopes of the same antigen). It iscontemplated that multiple families of sequences can be shuffled in thesame reaction. Further, shuffling generally conserves the relativeorder, such that, for example, CDR 1 will not be found in the positionof CDR2.

Rare shufflants will contain a large number of the best (eg. highestaffinity) CDRs and these rare shufflants may be selected based on theirsuperior affinity.

CDRs from a pool of 100 different selected antibody sequences can bepermutated in up to 1006 different ways. This large number ofpermutations cannot be represented in a single library of DNA sequences.Accordingly, it is contemplated that multiple cycles of DNA shufflingand selection may be required depending on the length of the sequenceand the sequence diversity desired.

Error-prone PCR, in contrast, keeps all the selected CDRs in the samerelative sequence, generating a much smaller mutant cloud.

The template polynucleotide which may be used in the methods of thisinvention may be DNA or RNA. It may be of various lengths depending onthe size of the gene or shorter or smaller polynucleotide to berecombined or reassembled. Preferably, the template polynucleotide isfrom 50 bp to 50 kb. It is contemplated that entire vectors containingthe nucleic acid encoding the protein of interest can be used in themethods of this invention, and in fact have been successfully used.

The template polynucleotide may be obtained by amplification using thePCR reaction (U.S. Pat. No. 4,683,202 and U.S. Pat. No. 4,683,195) orother amplification or cloning methods. However, the removal of freeprimers from the PCR products before subjecting them to pooling of thePCR products and sexual PCR may provide more efficient results. Failureto adequately remove the primers from the original pool before sexualPCR can lead to a low frequency of crossover clones.

The template polynucleotide often should be double-stranded. Adouble-stranded nucleic acid molecule is recommended to ensure thatregions of the resulting single-stranded polynucleotides arecomplementary to each other and thus can hybridize to form adouble-stranded molecule.

It is contemplated that single-stranded or double-stranded nucleic acidpolynucleotides having regions of identity to the templatepolynucleotide and regions of heterology to the template polynucleotidemay be added to the template polynucleotide, at this step. It is alsocontemplated that two different but related polynucleotide templates canbe mixed at this step.

The double-stranded polynucleotide template and any added double- orsingle-stranded polynucleotides are subjected to sexual PCR whichincludes slowing or halting to provide a mixture of from about 5 bp to 5kb or more. Preferably the size of the random polynucleotides is fromabout 10 bp to 1000 bp, more preferably the size of the polynucleotidesis from about 20 bp to 500 bp.

Alternatively, it is also contemplated that double-stranded nucleic acidhaving multiple nicks may be used in the methods of this invention. Anick is a break in one strand of the double-stranded nucleic acid. Thedistance between such nicks is preferably 5 bp to 5 kb, more preferablybetween 10 bp to 1000 bp. This can provide areas of self-priming toproduce shorter or smaller polynucleotides to be included with thepolynucleotides resulting from random primers, for example.

The concentration of any one specific polynucleotide will not be greaterthan 1% by weight of the total polynucleotides, more preferably theconcentration of any one specific nucleic acid sequence will not begreater than 0.1% by weight of the total nucleic acid.

The number of different specific polynucleotides in the mixture will beat least about 100, preferably at least about 500, and more preferablyat least about 1000.

At this step single-stranded or double-stranded polynucleotides, eithersynthetic or natural, may be added to the random double-stranded shorteror smaller polynucleotides in order to increase the heterogeneity of themixture of polynucleotides.

It is also contemplated that populations of double-stranded randomlybroken polynucleotides may be mixed or combined at this step with thepolynucleotides from the sexual PCR process and optionally subjected toone or more additional sexual PCR cycles.

Where insertion of mutations into the template polynucleotide isdesired, single-stranded or double-stranded polynucleotides having aregion of identity to the template polynucleotide and a region ofheterology to the template polynucleotide may be added in a 20 foldexcess by weight as compared to the total nucleic acid, more preferablythe single-stranded polynucleotides may be added in a 10 fold excess byweight as compared to the total nucleic acid.

Where a mixture of different but related template polynucleotides isdesired, populations of polynucleotides from each of the templates maybe combined at a ratio of less than about 1:100, more preferably theratio is less than about 1:40. For example, a backcross of the wild-typepolynucleotide with a population of mutated polynucleotide may bedesired to eliminate neutral mutations (e.g., mutations yielding aninsubstantial alteration in the phenotypic property being selected for).In such an example, the ratio of randomly provided wild-typepolynucleotides which may be added to the randomly provided sexual PCRcycle hybrid polynucleotides is approximately 1:1 to about 100:1, andmore preferably from 1:1 to 40:1.

The mixed population of random polynucleotides are denatured to formsingle-stranded polynucleotides and then re-annealed. Only thosesingle-stranded polynucleotides having regions of homology with othersingle-stranded polynucleotides will re-anneal.

The random polynucleotides may be denatured by heating. One skilled inthe art could determine the conditions necessary to completely denaturethe double-stranded nucleic acid. Preferably the temperature is from 80°C. to 100° C., more preferably the temperature is from 90° C. to 96° C.other methods which may be used to denature the polynucleotides includepressure (36) and pH.

The polynucleotides may be re-annealed by cooling. Preferably thetemperature is from 20° C. to 75° C., more preferably the temperature isfrom 40° C. to 65° C. If a high frequency of crossovers is needed basedon an average of only 4 consecutive bases of homology, recombination canbe forced by using a low annealing temperature, although the processbecomes more difficult. The degree of renaturation which occurs willdepend on the degree of homology between the population ofsingle-stranded polynucleotides.

Renaturation can be accelerated by the addition of polyethylene glycol(“PEG”) or salt. The salt concentration is preferably from 0 mM to 200mM, more preferably the salt concentration is from 10 mM to 100 mm. Thesalt may be KCl or NaCl. The concentration of PEG is preferably from 0%to 20%, more preferably from 5% to 10%.

The annealed polynucleotides are next incubated in the presence of anucleic acid polymerase and dNTP's (i.e. dATP, dCTP, DGTP and dTTP). Thenucleic acid polymerase may be the Klenow fragment, the Taq polymeraseor any other DNA polymerase known in the art.

The approach to be used for the assembly depends on the minimum degreeof homology that should still yield crossovers. If the areas of identityare large, Taq polymerase can be used with an annealing temperature ofbetween 45-65° C. If the areas of identity are small, Klenow polymerasecan be used with an annealing temperature of between 20-30° C. Oneskilled in the art could vary the temperature of annealing to increasethe number of crossovers achieved.

The polymerase may be added to the random polynucleotides prior toannealing, simultaneously with annealing or after annealing.

The cycle of denaturation, renaturation and incubation in the presenceof polymerase is referred to herein as shuffling or reassembly of thenucleic acid. This cycle is repeated for a desired number of times.Preferably the cycle is repeated from 2 to 50 times, more preferably thesequence is repeated from 10 to 40 times.

The resulting nucleic acid is a larger double-stranded polynucleotide offrom about 50 bp to about 100 kb, preferably the larger polynucleotideis from 500 bp to 50 kb.

This larger polynucleotide may contain a number of copies of apolynucleotide having the same size as the template polynucleotide intandem. This concatemeric polynucleotide is then denatured into singlecopies of the template polynucleotide. The result will be a populationof polynucleotides of approximately the same size as the templatepolynucleotide. The population will be a mixed population where singleor double-stranded polynucleotides having an area of identity and anarea of heterology have been added to the template polynucleotide priorto shuffling. These polynucleotides are then cloned into the appropriatevector and the ligation mixture used to transform bacteria.

It is contemplated that the single polynucleotides may be obtained fromthe larger concatemeric polynucleotide by amplification of the singlepolynucleotide prior to cloning by a variety of methods including PCR(U.S. Pat. No. 4,683,195 and U.S. Pat. No. 4,683,202), rather than bydigestion of the concatemer.

The vector used for cloning is not critical provided that it will accepta polynucleotide of the desired size. If expression of the particularpolynucleotide is desired, the cloning vehicle should further comprisetranscription and translation signals next to the site of insertion ofthe polynucleotide to allow expression of the polynucleotide in the hostcell. Preferred vectors include the pUC series and the pBR series ofplasmids.

The resulting bacterial population will include a number of recombinantpolynucleotides having random mutations. This mixed population may betested to identify the desired recombinant polynucleotides. The methodof selection will depend on the polynucleotide desired.

For example, if a polynucleotide which encodes a protein with increasedbinding efficiency to a ligand is desired, the proteins expressed byeach of the portions of the polynucleotides in the population or librarymay be tested for their ability to bind to the ligand by methods knownin the art (i.e. panning, affinity chromatography). If a polynucleotidewhich encodes for a protein with increased drug resistance is desired,the proteins expressed by each of the polynucleotides in the populationor library may be tested for their ability to confer drug resistance tothe host organism. One skilled in the art, given knowledge of thedesired protein, could readily test the population to identifypolynucleotides which confer the desired properties onto the protein.

It is contemplated that one skilled in the art could use a phage displaysystem in which fragments of the protein are expressed as fusionproteins on the phage surface (Pharmacia, Milwaukee Wis.). Therecombinant DNA molecules are cloned into the phage DNA at a site whichresults in the transcription of a fusion protein a portion of which isencoded by the recombinant DNA molecule. The phage containing therecombinant nucleic acid molecule undergoes replication andtranscription in the cell. The leader sequence of the fusion proteindirects the transport of the fusion protein to the tip of the phageparticle. Thus the fusion protein which is partially encoded by therecombinant DNA molecule is displayed on the phage particle fordetection and selection by the methods described above.

It is further contemplated that a number of cycles of nucleic acidshuffling may be conducted with polynucleotides from a subpopulation ofthe first population, which sub-population contains DNA encoding thedesired recombinant protein. In this manner, proteins with even higherbinding affinities or enzymatic activity could be achieved.

It is also contemplated that a number of cycles of nucleic acidshuffling may be conducted with a mixture of wild-type polynucleotidesand a sub-population of nucleic acid from the first or subsequent roundsof nucleic acid shuffling in order to remove any silent mutations fromthe sub-population.

Any source of nucleic acid, in purified form can be utilized as thestarting nucleic acid. Thus the process may employ DNA or RNA includingmessenger RNA, which DNA or RNA may be single or double stranded. Inaddition, a DNA-RNA hybrid which contains one strand of each may beutilized. The nucleic acid sequence may be of various lengths dependingon the size of the nucleic acid sequence to be mutated. Preferably thespecific nucleic acid sequence is from 50 to 50000 base pairs. It iscontemplated that entire vectors containing the nucleic acid encodingthe protein of interest may be used in the methods of this invention.

The nucleic acid may be obtained from any source, for example, fromplasmids such a pBR322, from cloned DNA or RNA or from natural DNA orRNA from any source including bacteria, yeast, viruses and higherorganisms such as plants or animals. DNA or RNA may be extracted fromblood or tissue material. The template polynucleotide may be obtained byamplification using the polynucleotide chain reaction (PCR, see U.S.Pat. No. 4,683,202 and U.S. Pat. No. 4,683,195). Alternatively, thepolynucleotide may be present in a vector present in a cell andsufficient nucleic acid may be obtained by culturing the cell andextracting the nucleic acid from the cell by methods known in the art.

Any specific nucleic acid sequence can be used to produce the populationof hybrids by the present process. It is only necessary that a smallpopulation of hybrid sequences of the specific nucleic acid sequenceexist or be created prior to the present process.

The initial small population of the specific nucleic acid sequenceshaving mutations may be created by a number of different methods.Mutations may be created by error-prone PCR. Error-prone PCR useslow-fidelity polymerization conditions to introduce a low level of pointmutations randomly over a long sequence. Alternatively, mutations can beintroduced into the template polynucleotide by oligonucleotide-directedmutagenesis. In oligonucleotide-directed mutagenesis, a short sequenceof the polynucleotide is removed from the polynucleotide usingrestriction enzyme digestion and is replaced with a syntheticpolynucleotide in which various bases have been altered from theoriginal sequence. The polynucleotide sequence can also be altered bychemical mutagenesis. Chemical mutagens include, for example, sodiumbisulfite, nitrous acid, hydroxylamine, hydrazine or formic acid. Otheragents which are analogues of nucleotide precursors includenitrosoguanidine, 5-bromouracil, 2-aminopurine, or acridine. Generally,these agents are added to the PCR reaction in place of the nucleotideprecursor thereby mutating the sequence. Intercalating agents such asproflavine, acriflavine, quinacrine and the like can also be used.Random mutagenesis of the polynucleotide sequence can also be achievedby irradiation with X-rays or ultraviolet light. Generally, plasmidpolynucleotides so mutagenized are introduced into E. coli andpropagated as a pool or library of hybrid plasmids.

Alternatively the small mixed population of specific nucleic acids maybe found in nature in that they may consist of different alleles of thesame gene or the same gene from different related species (i.e., cognategenes). Alternatively, they may be related DNA sequences found withinone species, for example, the immunoglobulin genes.

Once the mixed population of the specific nucleic acid sequences isgenerated, the polynucleotides can be used directly or inserted into anappropriate cloning vector, using techniques well-known in the art.

The choice of vector depends on the size of the polynucleotide sequenceand the host cell to be employed in the methods of this invention. Thetemplates of this invention may be plasmids, phages, cosmids, phagemids,viruses (e.g., retroviruses, parainfluenzavirus, herpesviruses,reoviruses, paramyxoviruses, and the like), or selected portions thereof(e.g., coat protein, spike glycoprotein, capsid protein). For example,cosmids and phagemids are preferred where the specific nucleic acidsequence to be mutated is larger because these vectors are able tostably propagate large polynucleotides.

If the mixed population of the specific nucleic acid sequence is clonedinto a vector it can be clonally amplified by inserting each vector intoa host cell and allowing the host cell to amplify the vector. This isreferred to as clonal amplification because while the absolute number ofnucleic acid sequences increases, the number of hybrids does notincrease. Utility can be readily determined by screening expressedpolypeptides.

The DNA shuffling method of this invention can be performed blindly on apool of unknown sequences. By adding to the reassembly mixtureoligonucleotides (with ends that are homologous to the sequences beingreassembled) any sequence mixture can be incorporated at any specificposition into another sequence mixture. Thus, it is contemplated thatmixtures of synthetic oligonucleotides, PCR polynucleotides or evenwhole genes can be mixed into another sequence library at definedpositions. The insertion of one sequence (mixture) is independent fromthe insertion of a sequence in another part of the template. Thus, thedegree of recombination, the homology required, and the diversity of thelibrary can be independently and simultaneously varied along the lengthof the reassembled DNA.

This approach of mixing two genes may be useful for the humanization ofantibodies from murine hybridomas. The approach of mixing two genes orinserting alternative sequences into genes may be useful for anytherapeutically used protein, for example, interleukin 1, antibodies,tPA and growth hormone. The approach may also be useful in any nucleicacid for example, promoters or introns or 3′ untranslated region or 5′untranslated regions of genes to increase expression or alterspecificity of expression of proteins. The approach may also be used tomutate ribozymes or aptamers.

Shuffling requires the presence of homologous regions separating regionsof diversity. Scaffold-like protein structures may be particularlysuitable for shuffling. The conserved scaffold determines the overallfolding by self-association, while displaying relatively unrestrictedloops that mediate the specific binding. Examples of such scaffolds arethe immunoglobulin beta-barrel, and the four-helix bundle which arewell-known in the art. This shuffling can be used to createscaffold-like proteins with various combinations of mutated sequencesfor binding.

3.2.1.1. In Vitro Shuffling

The equivalents of some standard genetic matings may also be performedby shuffling in vitro. For example, a “molecular backcross” can beperformed by repeatedly mixing the hybrid's nucleic acid with thewild-type nucleic acid while selecting for the mutations of interest. Asin traditional breeding, this approach can be used to combine phenotypesfrom different sources into a background of choice. It is useful, forexample, for the removal of neutral mutations that affect unselectedcharacteristics (i.e. immunogenicity). Thus it can be useful todetermine which mutations in a protein are involved in the enhancedbiological activity and which are not, an advantage which cannot beachieved by error-prone mutagenesis or cassette mutagenesis methods.

Large, functional genes can be assembled correctly from a mixture ofsmall random polynucleotides. This reaction may be of use for thereassembly of genes from the highly fragmented DNA of fossils. Inaddition random nucleic acid fragments from fossils may be combined withpolynucleotides from similar genes from related species.

It is also contemplated that the method of this invention can be usedfor the in vitro amplification of a whole genome from a single cell asis needed for a variety of research and diagnostic applications. DNAamplification by PCR is in practice limited to a length of about 40 kb.Amplification of a whole genome such as that of E. coli (5,000 kb) byPCR would require about 250 primers yielding 125 forty kbpolynucleotides. This approach is not practical due to theunavailability of sufficient sequence data. On the other hand, randomproduction of polynucleotides of the genome with sexual PCR cycles,followed by gel purification of small polynucleotides will provide amultitude of possible primers. Use of this mix of random smallpolynucleotides as primers in a PCR reaction alone or with the wholegenome as the template should result in an inverse chain reaction withthe theoretical endpoint of a single concatamer containing many copiesof the genome. 100 fold amplification in the copy number and an averagepolynucleotide size of greater than 50 kb may be obtained when onlyrandom polynucleotides are used. It is thought that the largerconcatamer is generated by overlap of many smaller polynucleotides. Thequality of specific PCR products obtained using synthetic primers willbe indistinguishable from the product obtained from unamplified DNA. Itis expected that this approach will be useful for the mapping ofgenomes.

The polynucleotide to be shuffled can be produced as random ornon-random polynucleotides, at the discretion of the practitioner.Moreover, this invention provides a method of shuffling that isapplicable to a wide range of polynucleotide sizes and types, includingthe step of generating polynucleotide monomers to be used as buildingblocks in the reassembly of a larger polynucleotide. For example, thebuilding blocks can be fragments of genes or they can be comprised ofentire genes or gene pathways, or any combination thereof.

3.2.1.2. In Vivo Shuffling

In an embodiment of in vivo shuffling, the mixed population of thespecific nucleic acid sequence is introduced into bacterial oreukaryotic cells under conditions such that at least two differentnucleic acid sequences are present in each host cell. Thepolynucleotides can be introduced into the host cells by a variety ofdifferent methods. The host cells can be transformed with the smallerpolynucleotides using methods known in the art, for example treatmentwith calcium chloride. If the polynucleotides are inserted into a phagegenome, the host cell can be transfected with the recombinant phagegenome having the specific nucleic acid sequences. Alternatively, thenucleic acid sequences can be introduced into the host cell usingelectroporation, transfection, lipofection, biolistics, conjugation, andthe like.

In general, in this embodiment, the specific nucleic acids sequenceswill be present in vectors which are capable of stably replicating thesequence in the host cell. In addition, it is contemplated that thevectors will encode a marker gene such that host cells having the vectorcan be selected. This ensures that the mutated specific nucleic acidsequence can be recovered after introduction into the host cell.However, it is contemplated that the entire mixed population of thespecific nucleic acid sequences need not be present on a vectorsequence. Rather only a sufficient number of sequences need be clonedinto vectors to ensure that after introduction of the polynucleotidesinto the host cells each host cell contains one vector having at leastone specific nucleic acid sequence present therein. It is alsocontemplated that rather than having a subset of the population of thespecific nucleic acids sequences cloned into vectors, this subset may bealready stably integrated into the host cell.

It has been found that when two polynucleotides which have regions ofidentity are inserted into the host cells homologous recombinationoccurs between the two polynucleotides. Such recombination between thetwo mutated specific nucleic acid sequences will result in theproduction of double or triple hybrids in some situations.

It has also been found that the frequency of recombination is increasedif some of the mutated specific nucleic acid sequences are present onlinear nucleic acid molecules. Therefore, in a preferred embodiment,some of the specific nucleic acid sequences are present on linearpolynucleotides.

After transformation, the host cell transformants are placed underselection to identify those host cell transformants which containmutated specific nucleic acid sequences having the qualities desired.For example, if increased resistance to a particular drug is desiredthen the transformed host cells may be subjected to increasedconcentrations of the particular drug and those transformants producingmutated proteins able to confer increased drug resistance will beselected. If the enhanced ability of a particular protein to bind to areceptor is desired, then expression of the protein can be induced fromthe transformants and the resulting protein assayed in a ligand bindingassay by methods known in the art to identify that subset of the mutatedpopulation which shows enhanced binding to the ligand. Alternatively,the protein can be expressed in another system to ensure properprocessing.

Once a subset of the first recombined specific nucleic acid sequences(daughter sequences) having the desired characteristics are identified,they are then subject to a second round of recombination.

In the second cycle of recombination, the recombined specific nucleicacid sequences may be mixed with the original mutated specific nucleicacid sequences (parent sequences) and the cycle repeated as describedabove. In this way a set of second recombined specific nucleic acidssequences can be identified which have enhanced characteristics orencode for proteins having enhanced properties. This cycle can berepeated a number of times as desired.

It is also contemplated that in the second or subsequent recombinationcycle, a backcross can be performed. A molecular backcross can beperformed by mixing the desired specific nucleic acid sequences with alarge number of the wild-type sequence, such that at least one wild-typenucleic acid sequence and a mutated nucleic acid sequence are present inthe same host cell after transformation. Recombination with thewild-type specific nucleic acid sequence will eliminate those neutralmutations that may affect unselected characteristics such asimmunogenicity but not the selected characteristics.

In another embodiment of this invention, it is contemplated that duringthe first round a subset of the specific nucleic acid sequences can begenerated as smaller polynucleotides by slowing or halting their PCRamplification prior to introduction into the host cell. The size of thepolynucleotides must be large enough to contain some regions of identitywith the other sequences so as to homologously recombine with the othersequences. The size of the polynucleotides will range from 0.03 kb to100 kb more preferably from 0.2 kb to 10 kb. It is also contemplatedthat in subsequent rounds, all of the specific nucleic acid sequencesother than the sequences selected from the previous round may beutilized to generate PCR polynucleotides prior to introduction into thehost cells.

The shorter polynucleotide sequences can be single-stranded ordouble-stranded. If the sequences were originally single-stranded andhave become double-stranded they can be denatured with heat, chemicalsor enzymes prior to insertion into the host cell. The reactionconditions suitable for separating the strands of nucleic acid are wellknown in the art.

The steps of this process can be repeated indefinitely, being limitedonly by the number of possible hybrids which can be achieved. After acertain number of cycles, all possible hybrids will have been achievedand further cycles are redundant.

In an embodiment the same mutated template nucleic acid is repeatedlyrecombined and the resulting recombinants selected for the desiredcharacteristic.

Therefore, the initial pool or population of mutated template nucleicacid is cloned into a vector capable of replicating in a bacteria suchas E. coli. The particular vector is not essential, so long as it iscapable of autonomous replication in E. coli. In a preferred embodiment,the vector is designed to allow the expression and production of anyprotein encoded by the mutated specific nucleic acid linked to thevector. It is also preferred that the vector contain a gene encoding fora selectable marker.

The population of vectors containing the pool of mutated nucleic acidsequences is introduced into the E. coli host cells. The vector nucleicacid sequences may be introduced by transformation, transfection orinfection in the case of phage. The concentration of vectors used totransform the bacteria is such that a number of vectors is introducedinto each cell. Once present in the cell, the efficiency of homologousrecombination is such that homologous recombination occurs between thevarious vectors. This results in the generation of hybrids (daughters)having a combination of mutations which differ from the original parentmutated sequences.

The host cells are then clonally replicated and selected for the markergene present on the vector. Only those cells having a plasmid will growunder the selection.

The host cells which contain a vector are then tested for the presenceof favorable mutations. Such testing may consist of placing the cellsunder selective pressure, for example, if the gene to be selected is animproved drug resistance gene. If the vector allows expression of theprotein encoded by the mutated nucleic acid sequence, then suchselection may include allowing expression of the protein so encoded,isolation of the protein and testing of the protein to determinewhether, for example, it binds with increased efficiency to the ligandof interest.

Once a particular daughter mutated nucleic acid sequence has beenidentified which confers the desired characteristics, the nucleic acidis isolated either already linked to the vector or separated from thevector. This nucleic acid is then mixed with the first or parentpopulation of nucleic acids and the cycle is repeated.

It has been shown that by this method nucleic acid sequences havingenhanced desired properties can be selected.

In an alternate embodiment, the first generation of hybrids are retainedin the cells and the parental mutated sequences are added again to thecells. Accordingly, the first cycle of Embodiment 1 is conducted asdescribed above. However, after the daughter nucleic acid sequences areidentified, the host cells containing these sequences are retained.

The parent mutated specific nucleic acid population, either aspolynucleotides or cloned into the same vector is introduced into thehost cells already containing the daughter nucleic acids. Recombinationis allowed to occur in the cells and the next generation ofrecombinants, or granddaughters are selected by the methods describedabove.

This cycle can be repeated a number of times until the nucleic acid orpeptide having the desired characteristics is obtained. It iscontemplated that in subsequent cycles, the population of mutatedsequences which are added to the preferred hybrids may come from theparental hybrids or any subsequent generation.

In an alternative embodiment, the invention provides a method ofconducting a “molecular” backcross of the obtained recombinant specificnucleic acid in order to eliminate any neutral mutations. Neutralmutations are those mutations which do not confer onto the nucleic acidor peptide the desired properties. Such mutations may however confer onthe nucleic acid or peptide undesirable characteristics. Accordingly, itis desirable to eliminate such neutral mutations. The method of thisinvention provide a means of doing so.

In this embodiment, after the hybrid nucleic acid, having the desiredcharacteristics, is obtained by the methods of the embodiments, thenucleic acid, the vector having the nucleic acid or the host cellcontaining the vector and nucleic acid is isolated.

The nucleic acid or vector is then introduced into the host cell with alarge excess of the wild-type nucleic acid. The nucleic acid of thehybrid and the nucleic acid of the wild-type sequence are allowed torecombine. The resulting recombinants are placed under the sameselection as the hybrid nucleic acid. Only those recombinants whichretained the desired characteristics will be selected. Any silentmutations which do not provide the desired characteristics will be lostthrough recombination with the wild-type DNA. This cycle can be repeateda number of times until all of the silent mutations are eliminated.

Thus the methods of this invention can be used in a molecular backcrossto eliminate unnecessary or silent mutations.

3.2.2. Exonuclease-Mediated Reassembly

In a particular embodiment, this invention provides for a method forshuffling, assembling, reassembling, recombining, &/or concatenating atleast two polynucleotides to form a progeny polynucleotide (e.g. achimeric progeny polynucleotide that can be expressed to produce apolypeptide or a gene pathway). In a particular embodiment, a doublestranded polynucleotide end (e.g. two single stranded sequenceshybridized to each other as hybridization partners) is treated with anexonuclease to liberate nucleotides from one of the two strands, leavingthe remaining strand free of its original partner so that, if desired,the remaining strand may be used to achieve hybridization to anotherpartner.

In a particular aspect, a double stranded polynucleotide end (that maybe part of—or connected to—a polynucleotide or a nonpolynucleotidesequence) is subjected to a source of exonuclease activity. Serviceablesources of exonuclease activity may be an enzyme with 3′ exonucleaseactivity, an enzyme with 5′ exonuclease activity, an enzyme with both 3′exonuclease activity and 5′ exonuclease activity, and any combinationthereof. An exonuclease can be used to liberate nucleotides from one orboth ends of a linear double stranded polynucleotide, and from one toall ends of a branched polynucleotide having more than two ends. Themechanism of action of this liberation is believed to be comprised of anenzymatically-catalyzed hydrolysis of terminal nucleotides, and can beallowed to proceed in a time-dependent fashion, allowing experimentalcontrol of the progression of the enzymatic process.

By contrast, a non-enzymatic step may be used to shuffle, assemble,reassemble, recombine, and/or concatenate polynucleotide building blocksthat is comprised of subjecting a working sample to denaturing (or“melting”) conditions (for example, by changing temperature, pH, and/orsalinity conditions) so as to melt a working set of double strandedpolynucleotides into single polynucleotide strands. For shuffling, it isdesirable that the single polynucleotide strands participate to someextent in annealment with different hybridization partners (i.e. and notmerely revert to exclusive reannealment between what were formerpartners before the denaturation step). The presence of the formerhybridization partners in the reaction vessel, however, does notpreclude, and may sometimes even favor, reannealment of a singlestranded polynucleotide with its former partner, to recreate an originaldouble stranded polynucleotide.

In contrast to this non-enzymatic shuffling step comprised of subjectingdouble stranded polynucleotide building blocks to denaturation, followedby annealment, the instant invention further provides anexonuclease-based approach requiring no denaturation—rather, theavoidance of denaturing conditions and the maintenance of doublestranded polynucleotide substrates in annealed (i.e. non-denatured)state are necessary conditions for the action of exonucleases (e.g.,exonuclease III and red alpha gene product). Additionally in contrast,the generation of single stranded polynucleotide sequences capable ofhybridizing to other single stranded polynucleotide sequences is theresult of covalent cleavage—and hence sequence destruction—in one of thehybridization partners. For example, an exonuclease III enzyme may beused to enzymatically liberate 3′ terminal nucleotides in onehybridization strand (to achieve covalent hydrolysis in thatpolynucleotide strand); and this favors hybridization of the remainingsingle strand to a new partner (since its former partner was subjectedto covalent cleavage).

By way of further illustration, a specific exonuclease, namelyexonuclease III is provided herein as an example of a 3′ exonuclease;however, other exonucleases may also be used, including enzymes with 5′exonuclease activity and enzymes with 3′ exonuclease activity, andincluding enzymes not yet discovered and enzymes not yet developed. Itis particularly appreciated that enzymes can be discovered, optimized(e.g. engineered by directed evolution), or both discovered andoptimized specifically for the instantly disclosed approach that havemore optimal rates &/or more highly specific activities &/or greaterlack of unwanted activities. In fact it is expected that the instantinvention may encourage the discovery &/or development of such designerenzymes. In sum, this invention may be practiced with a variety ofcurrently available exonuclease enzymes, as well as enzymes not yetdiscovered and enzymes not yet developed.

The exonuclease action of exonuclease III requires a working doublestranded polynucleotide end that is either blunt or has a 5′ overhang,and the exonuclease action is comprised of enzymatically liberating 3′terminal nucleotides, leaving a single stranded 5′ end that becomeslonger and longer as the exonuclease action proceeds (see FIG. 1). Any5′ overhangs produced by this approach may be used to hybridize toanother single stranded polynucleotide sequence (which may also be asingle stranded polynucleotide or a terminal overhang of a partiallydouble stranded polynucleotide) that shares enough homology to allowhybridization. The ability of these exonuclease III-generated singlestranded sequences (e.g. in 5′ overhangs) to hybridize to other singlestranded sequences allows two or more polynucleotides to be shuffled,assembled, reassembled, &/or concatenated.

Furthermore, it is appreciated that one can protect the end of a doublestranded polynucleotide or render it susceptible to a desired enzymaticaction of a serviceable exonuclease as necessary. For example, a doublestranded polynucleotide end having a 3′ overhang is not susceptible tothe exonuclease action of exonuclease III. However, it may be renderedsusceptible to the exonuclease action of exonuclease III by a variety ofmeans; for example, it may be blunted by treatment with a polymerase,cleaved to provide a blunt end or a 5′ overhang, joined (ligated orhybridized) to another double stranded polynucleotide to provide a bluntend or a 5′ overhang, hybridized to a single stranded polynucleotide toprovide a blunt end or a 5′ overhang, or modified by any of a variety ofmeans).

According to one aspect, an exonuclease may be allowed to act on one oron both ends of a linear double stranded polynucleotide and proceed tocompletion, to near completion, or to partial completion. When theexonuclease action is allowed to go to completion, the result will bethat the length of each 5′ overhang will extend far towards the middleregion of the polynucleotide in the direction of what might beconsidered a “rendezvous point” (which may be somewhere near thepolynucleotide midpoint). Ultimately, this results in the production ofsingle stranded polynucleotides (that can become dissociated) that areeach about half the length of the original double strandedpolynucleotide (see FIG. 1). Alternatively, an exonuclease-mediatedreaction can be terminated before proceeding to completion.

Thus this exonuclease-mediated approach is serviceable for shuffling,assembling &/or reassembling, recombining, and concatenatingpolynucleotide building blocks, which polynucleotide building blocks canbe up to ten bases long or tens of bases long or hundreds of bases longor thousands of bases long or tens of thousands of bases long orhundreds of thousands of bases long or millions of bases long or evenlonger.

This exonuclease-mediated approach is based on the action of doublestranded DNA specific exodeoxyribonuclease activity of E. coliexonuclease 111. Substrates for exonuclease III may be generated bysubjecting a double stranded polynucleotide to fragmentation.Fragmentation may be achieved by mechanical means (e.g., shearing,sonication, etc.), by enzymatic means (e.g. using restriction enzymes),and by any combination thereof. Fragments of a larger polynucleotide mayalso be generated by polymerase-mediated synthesis.

Exonuclease III is a 28K monomeric enzyme, product of the xthA gene ofE. coli with four known activities: exodeoxyribonuclease (alternativelyreferred to as exonuclease herein), RNaseH, DNA-3′-phosphatase, and APendonuclease. The exodeoxyribonuclease activity is specific for doublestranded DNA. The mechanism of action is thought to involve enzymatichydrolysis of DNA from a 3′ end progressively towards a 5′ direction,with formation of nucleoside 5′-phosphates and a residual single strand.The enzyme does not display efficient hydrolysis of single stranded DNA,single-stranded RNA, or double-stranded RNA; however it degrades RNA inan DNA-RNA hybrid releasing nucleoside 5′-phosphates. The enzyme alsoreleases inorganic phosphate specifically from 3′phosphomonoester groupson DNA, but not from RNA or short oligonucleotides. Removal of thesegroups converts the terminus into a primer for DNA polymerase action.

Additional examples of enzymes with exonuclease activity includered-alpha and venom phosphodiesterases. Red alpha (red gene product(also referred to as lambda exonuclease) is of bacteriophage origin. Thered gene is transcribed from the leftward promoter and its product isinvolved (24 kD) in recombination. Red alpha gene product actsprocessively from 5′-phosphorylated termini to liberate mononucleotidesfrom duplex DNA (Takahashi & Kobayashi, 1990). Venom phosphodiesterases(Laskowski, 1980) is capable of rapidly opening supercoiled DNA.

3.2.3. Non-Stochastic Ligation Reassembly

In one aspect, the present invention provides a non-stochastic methodtermed synthetic ligation reassembly (SLR), that is somewhat related tostochastic shuffling, save that the nucleic acid building blocks are notshuffled or concatenated or chimerized randomly, but rather areassembled non-stochastically.

A particularly glaring difference is that the instant SLR method doesnot depend on the presence of a high level of homology betweenpolynucleotides to be shuffled. In contrast, prior methods, particularlyprior stochastic shuffling methods require that presence of a high levelof homology, particularly at coupling sites, between polynucleotides tobe shuffled. Accordingly these prior methods favor the regeneration ofthe original progenitor molecules, and are suboptimal for generatinglarge numbers of novel progeny chimeras, particularly full-lengthprogenies. The instant invention, on the other hand, can be used tonon-stochastically generate libraries (or sets) of progeny moleculescomprised of over 10¹⁰⁰ different chimeras. Conceivably, SLR can even beused to generate libraries comprised of over 10¹⁰⁰⁰ different progenychimeras with (no upper limit in sight).

Thus, in one aspect, the present invention provides a method, whichmethod is non-stochastic, of producing a set of finalized chimericnucleic acid molecules having an overall assembly order that is chosenby design, which method is comprised of the steps of generating bydesign a plurality of specific nucleic acid building blocks havingserviceable mutually compatible ligatable ends, and assembling thesenucleic acid building blocks, such that a designed overall assemblyorder is achieved.

The mutually compatible ligatable ends of the nucleic acid buildingblocks to be assembled are considered to be “serviceable” for this typeof ordered assembly if they enable the building blocks to be coupled inpredetermined orders. Thus, in one aspect, the overall assembly order inwhich the nucleic acid building blocks can be coupled is specified bythe design of the ligatable ends and, if more than one assembly step isto be used, then the overall assembly order in which the nucleic acidbuilding blocks can be coupled is also specified by the sequential orderof the assembly step(s). FIG. 4, Panel C illustrates an exemplaryassembly process comprised of 2 sequential steps to achieve a designed(non-stochastic) overall assembly order for five nucleic acid buildingblocks. In a preferred embodiment of this invention, the annealedbuilding pieces are treated with an enzyme, such as a ligase (e.g. T4DNA ligase), achieve covalent bonding of the building pieces.

In a preferred embodiment, the design of nucleic acid building blocks isobtained upon analysis of the sequences of a set of progenitor nucleicacid templates that serve as a basis for producing a progeny set offinalized chimeric nucleic acid molecules. These progenitor nucleic acidtemplates thus serve as a source of sequence information that aids inthe design of the nucleic acid building blocks that are to bemutagenized, i.e. chimerized or shuffled.

In one exemplification, this invention provides for the chimerization ofa family of related genes and their encoded family of related products.In a particular exemplification, the encoded products are enzymes. As arepresentative list of families of enzymes which may be mutagenized inaccordance with the aspects of the present invention, there may bementioned, the following enzymes and their functions:

-   1 Lipase/Esterase    -   a. Enantioselective hydrolysis of esters (lipids)/thioesters        -   1) Resolution of racemic mixtures        -   2) Synthesis of optically active acids or alcohols from            meso-diesters    -   b. Selective syntheses        -   1) Regiospecific hydrolysis of carbohydrate esters        -   2) Selective hydrolysis of cyclic secondary alcohols    -   c. Synthesis of optically active esters, lactones, acids,        alcohols        -   1) Transesterification of activated/nonactivated esters        -   2) Interesterification        -   3) Optically active lactones from hydroxyesters        -   4) Regio- and enantioselective ring opening of anhydrides    -   d. Detergents    -   e. Fat/Oil conversion    -   f. Cheese ripening-   2 Protease    -   a. Ester/amide synthesis    -   b. Peptide synthesis    -   c. Resolution of racemic mixtures of amino acid esters    -   d. Synthesis of non-natural amino acids    -   e. Detergents/protein hydrolysis-   3 Glycosidase/Glycosyl transferase    -   a. Sugar/polymer synthesis    -   b. Cleavage of glycosidic linkages to form mono, di- and        oligosaccharides    -   c. Synthesis of complex oligosaccharides    -   d. Glycoside synthesis using UDP-galactosyl transferase    -   e. Transglycosylation of disaccharides, glycosyl fluorides, aryl        galactosides    -   f. Glycosyl transfer in oligosaccharide synthesis    -   g. Diastereoselective cleavage of -glucosylsulfoxides    -   h. Asymmetric glycosylations    -   i. Food processing    -   j. Paper processing-   4. Phosphatase/Kinase    -   a. Synthesis/hydrolysis of phosphate esters        -   1) Regio-, enantioselective phosphorylation        -   2) Introduction of phosphate esters        -   3) Synthesize phospholipid precursors        -   4) Controlled polynucleotide synthesis    -   b. Activate biological molecule    -   c. Selective phosphate bond formation without protecting groups-   5. Mono/Dioxygenase    -   a. Direct oxyfunctionalization of unactivated organic substrates    -   b. Hydroxylation of alkane, aromatics, steroids    -   c. Epoxidation of alkenes    -   d. Enantioselective sulphoxidation    -   e. Regio- and stereoselective Bayer-Villiger oxidations-   6. Haloperoxidase    -   a. Oxidative addition of halide ion to nucleophilic sites    -   b. Addition of hypohalous acids to olefinic bonds    -   c. Ring cleavage of cyclopropanes    -   d. Activated aromatic substrates converted to ortho and para        derivatives    -   e. 1.3 diketones converted to 2-halo-derivatives    -   f. Heteroatom oxidation of sulfur and nitrogen containing        substrates    -   g. Oxidation of enol acetates, alkynes and activated aromatic        rings-   7. Lignin peroxidase/Diarylpropane peroxidase    -   a. Oxidative cleavage of C—C bonds    -   b. Oxidation of benzylic alcohols to aldehydes    -   c. Hydroxylation of benzylic carbons    -   d. Phenol dimerization    -   e. Hydroxylation of double bonds to form diols    -   f. Cleavage of lignin aldehydes-   8. Epoxide hydrolase    -   a. Synthesis of enantiomerically pure bioactive compounds    -   b. Regio- and enantioselective hydrolysis of epoxide    -   c. Aromatic and olefinic epoxidation by monooxygenases to form        epoxides    -   d. Resolution of racemic epoxides    -   e. Hydrolysis of steroid epoxides-   9. Nitrile hydmtase/nitrilase    -   a. Hydrolysis of aliphatic nitriles to carboxamides    -   b. Hydrolysis of aromatic, heterocyclic, unsaturated aliphatic        nitriles to corresponding acids    -   c. Hydrolysis of acrylonitrile    -   d. Production of aromatic and carboxamides, carboxylic acids        (nicotinamide, picolinamide, isonicotinamide)    -   e. Regioselective hydrolysis of acrylic dinitrile    -   f. -amino acids from -hydroxynitriles-   10. Transaminase    -   a. Transfer of amino groups into oxo-acids-   11. Amidase/Acylase    -   a. Hydrolysis of amides, amidines, and other C—N bonds    -   b. Non-natural amino acid resolution and synthesis

These exemplifications, while illustrating certain specific aspects ofthe invention, do not portray the limitations or circumscribe the scopeof the disclosed invention.

Thus according to one aspect of this invention, the sequences of aplurality of progenitor nucleic acid templates are aligned in order toselect one or more demarcation points, which demarcation points can belocated at an area of homology, and are comprised of one or morenucleotides, and which demarcation points are shared by at least two ofthe progenitor templates. The demarcation points can be used todelineate the boundaries of nucleic acid building blocks to begenerated. Thus, the demarcation points identified and selected in theprogenitor molecules serve as potential chimerization points in theassembly of the progeny molecules.

Preferably a serviceable demarcation point is an area of homology(comprised of at least one homologous nucleotide base) shared by atleast two progenitor templates. More preferably a serviceabledemarcation point is an area of homology that is shared by at least halfof the progenitor templates. More preferably still a serviceabledemarcation point is an area of homology that is shared by at least twothirds of the progenitor templates. Even more preferably a serviceabledemarcation points is an area of homology that is shared by at leastthree fourths of the progenitor templates. Even more preferably still aserviceable demarcation points is an area of homology that is shared byat almost all of the progenitor templates. Even more preferably still aserviceable demarcation point is an area of homology that is shared byall of the progenitor templates.

The process of designing nucleic acid building blocks and of designingthe mutually compatible ligatable ends of the nucleic acid buildingblocks to be assembled is illustrated in FIGS. 6 and 7. As shown, thealignment of a set of progenitor templates reveals several naturallyoccurring demarcation points, and the identification of demarcationpoints shared by these templates helps to non-stochastically determinethe building blocks to be generated and used for the generation of theprogeny chimeric molecules.

In a preferred embodiment, this invention provides that the ligationreassembly process is performed exhaustively in order to generate anexhaustive library. In other words, all possible ordered combinations ofthe nucleic acid building blocks are represented in the set of finalizedchimeric nucleic acid molecules. At the same time, in a particularlypreferred embodiment, the assembly order (i.e. the order of assembly ofeach building block in the 5′ to 3′ sequence of each finalized chimericnucleic acid) in each combination is by design (or non-stochastic).Because of the non-stochastic nature of this invention, the possibilityof unwanted side products is greatly reduced.

In another preferred embodiment, this invention provides that theligation reassembly process is performed systematically, for example inorder to generate a systematically compartmentalized library, withcompartments that can be screened systematically, e.g. one by one. Inother words this invention provides that, through the selective andjudicious use of specific nucleic acid building blocks, coupled with theselective and judicious use of sequentially stepped assembly reactions,an experimental design can be achieved where specific sets of progenyproducts are made in each of several reaction vessels. This allows asystematic examination and screening procedure to be performed. Thus, itallows a potentially very large number of progeny molecules to beexamined systematically in smaller groups.

Because of its ability to perform chimerizations in a manner that ishighly flexible yet exhaustive and systematic as well, particularly whenthere is a low level of homology among the progenitor molecules, theinstant invention provides for the generation of a library (or set)comprised of a large number of progeny molecules. Because of thenon-stochastic nature of the instant ligation reassembly invention, theprogeny molecules generated preferably comprise a library of finalizedchimeric nucleic acid molecules having an overall assembly order that ischosen by design. In a particularly preferred embodiment of thisinvention, such a generated library is comprised of preferably greaterthan 10³ different progeny molecular species, more preferably greaterthan 10⁵ different progeny molecular species, more preferably stillgreater than 10¹⁰ different progeny molecular species, more preferablystill greater than 10¹⁵ different progeny molecular species, morepreferably still greater than 10²⁰ different progeny molecular species,more preferably still greater than 10³⁰ different progeny molecularspecies, more preferably still greater than 10⁴⁰ different progenymolecular species, more preferably still greater than 10⁵⁰ differentprogeny molecular species, more preferably still greater than 10⁶⁰different progeny molecular species, more preferably still greater than10⁷⁰ different progeny molecular species, more preferably still greaterthan 10⁸⁰ different progeny molecular species, more preferably stillgreater than 10¹⁰⁰ different progeny molecular species, more preferablystill greater than 10¹¹⁰ different progeny molecular species, morepreferably still greater than 10¹²⁰ different progeny molecular species,more preferably still greater than 10¹³⁰ different progeny molecularspecies, more preferably still greater than 10¹⁴⁰ different progenymolecular species, more preferably still greater than 10¹⁵⁰ differentprogeny molecular species, more preferably still greater than 10¹⁷⁵different progeny molecular species, more preferably still greater than10²⁰⁰ different progeny molecular species, more preferably still greaterthan 10³⁰⁰ different progeny molecular species, more preferably stillgreater than 10⁴⁰⁰ different progeny molecular species, more preferablystill greater than 10⁵⁰⁰ different progeny molecular species, and evenmore preferably still greater than 10¹⁰⁰⁰ different progeny molecularspecies.

In one aspect, a set of finalized chimeric nucleic acid molecules,produced as described is comprised of a polynucleotide encoding apolypeptide. According to one preferred embodiment, this polynucleotideis a gene, which may be a man-made gene. According to another preferredembodiment, this polynucleotide is a gene pathway, which may be aman-made gene pathway. This invention provides that one or more man-madegenes generated by this invention may be incorporated into a man-madegene pathway, such as a pathway operable in a eukaryotic organism(including a plant).

It is appreciated that the power of this invention is exceptional, asthere is much freedom of choice and control regarding the selection ofdemarcation points, the size and number of the nucleic acid buildingblocks, and the size and design of the couplings. It is appreciated,furthermore, that the requirement for intermolecular homology is highlyrelaxed for the operability of this invention. In fact, demarcationpoints can even be chosen in areas of little or no intermolecularhomology. For example, because of codon wobble, i.e. the degeneracy ofcodons, nucleotide substitutions can be introduced into nucleic acidbuilding blocks without altering the amino acid originally encoded inthe corresponding progenitor template. Alternatively, a codon can bealtered such that the coding for an originally amino acid is altered.This invention provides that such substitutions can be introduced intothe nucleic acid building block in order to increase the incidence ofintermolecularly homologous demarcation points and thus to allow anincreased number of couplings to be achieved among the building blocks,which in turn allows a greater number of progeny chimeric molecules tobe generated.

In another exemplifaction, the synthetic nature of the step in which thebuilding blocks are generated allows the design and introduction ofnucleotides (e.g. one or more nucleotides, which may be, for example,codons or introns or regulatory sequences) that can later be optionallyremoved in an in vitro process (e.g. by mutageneis) or in an in vivoprocess (e.g. by utilizing the gene splicing ability of a hostorganism). It is appreciated that in many instances the introduction ofthese nucleotides may also be desirable for many other reasons inaddition to the potential benefit of creating a serviceable demarcationpoint.

Thus, according to another embodiment, this invention provides that anucleic acid building block can be used to introduce an intron. Thus,this invention provides that functional introns may be introduced into aman-made gene of this invention. This invention also provides thatfunctional introns may be introduced into a man-made gene pathway ofthis invention. Accordingly, this invention provides for the generationof a chimeric polynucleotide that is a man-made gene containing one (ormore) artificially introduced intron(s).

Accordingly, this invention also provides for the generation of achimeric polynucleotide that is a man-made gene pathway containing one(or more) artificially introduced intron(s). Preferably, theartificially introduced intron(s) are functional in one or more hostcells for gene splicing much in the way that naturally-occurring intronsserve functionally in gene splicing. This invention provides a processof producing man-made intron-containing polynucleotides to be introducedinto host organisms for recombination and/or splicing.

The ability to achieve chimerizations, using couplings as describedherein, in areas of little or no homology among the progenitormolecules, is particularly useful, and in fact critical, for theassembly of novel gene pathways. This invention thus provides for thegeneration of novel man-made gene pathways using synthetic ligationreassembly. In a particular aspect, this is achieved by the introductionof regulatory sequences, such as promoters, that are operable in anintended host, to confer operability to a novel gene pathway when it isintroduced into the intended host. In a particular exemplification, thisinvention provides for the generation of novel man-made gene pathwaysthat is operable in a plurality of intended hosts (e.g. in a microbialorganism as well as in a plant cell). This can be achieved, for example,by the introduction of a plurality of regulatory sequences, comprised ofa regulatory sequence that is operable in a first intended host and aregulatory sequence that is operable in a second intended host. Asimilar process can be performed to achieve operability of a genepathway in a third intended host species, etc. The number of intendedhost species can be each integer from 1 to 10 or alternatively over 10.Alternatively, for example, operability of a gene pathway in a pluralityof intended hosts can be achieved by the introduction of a regulatorysequence having intrinsic operability in a plurality of intended hosts.

Thus, according to a particular embodiment, this invention provides thata nucleic acid building block can be used to introduce a regulatorysequence, particularly a regulatory sequence for gene expression.Preferred regulatory sequences include, but are not limited to, thosethat are man-made, and those found in archeal, bacterial, eukaryotic(including mitochondrial), viral, and prionic or prion-like organisms.Preferred regulatory sequences include but are not limited to,promoters, operators, and activator binding sites. Thus, this inventionprovides that functional regulatory sequences may be introduced into aman-made gene of this invention. This invention also provides thatfunctional regulatory sequences may be introduced into a man-made genepathway of this invention.

Accordingly, this invention provides for the generation of a chimericpolynucleotide that is a man-made gene containing one (or more)artificially introduced regulatory sequence(s). Accordingly, thisinvention also provides for the generation of a chimeric polynucleotidethat is a man-made gene pathway containing one (or more) artificiallyintroduced regulatory sequence(s). Preferably, an artificiallyintroduced regulatory sequence(s) is operatively linked to one or moregenes in the man-made polynucleotide, and are functional in one or morehost cells.

Preferred bacterial promoters that are serviceable for this inventioninclude lac, lacZ, T3, T7, gpt, lambda P_(R), P_(L) and trp. Serviceableeukaryotic promoters include CMV immediate early, HSV thymidine kinase,early and late SV40, LTRs from retrovirus, and mouse metallothionein-1.Particular plant regulatory sequences include promoters active indirecting transcription in plants, either constitutively or stage and/ortissue specific, depending on the use of the plant or parts thereof.These promoters include, but are not limited to promoters showingconstitutive expression, such as the 35S promoter of Cauliflower MosaicVirus (CaMV) (Guilley et al., 1982), those for leaf-specific expression,such as the promoter of the ribulose bisphosphate carboxylase smallsubunit gene (Coruzzi et al., 1984), those for root-specific expression,such as the promoter from the glutamin synthase gene (Tingey et al.,1987), those for seed-specific expression, such as the cruciferin Apromoter from Brassica napus (Ryan et al., 1989), those fortuber-specific expression, such as the class-I patatin promoter frompotato (Rocha-Sasa et al., 1989; Wenzler et al., 1989) or those forfruit-specific expression, such as the polygalacturonase (PG) promoterfrom tomato (Bird et al., 1988).

Other regulatory sequences that are preferred for this invention includeterminator sequences and polyadenylation signals and any such sequencefunctioning as such in plants, the choice of which is within the levelof the skilled artisan. An example of such sequences is the 3′ flankingregion of the nopaline synthase (nos) gene of Agrobacterium tumefaciens(Bevan, 1984). The regulatory sequences may also include enhancersequences, such as found in the 35S promoter of CaMV, and mRNAstabilizing sequences such as the leader sequence of Alfalfa MosaicCirus (AIMV) RNA4 (Brederode et al., 1980) or any other sequencesfunctioning in a like manner.

Man-made genes produced using this invention can also serve as asubstrate for recombination with another nucleic acid. Likewise, aman-made gene pathway produced using this invention can also serve as asubstrate for recombination with another nucleic acid. In a preferredinstance, the recombination is facilitated by, or occurs at, areas ofhomology between the man-made intron-containing gene and a nucleic acidwith serves as a recombination partner. In a particularly preferredinstance, the recombination partner may also be a nucleic acid generatedby this invention, including a man-made gene or a man-made gene pathway.Recombination may be facilitated by or may occur at areas of homologythat exist at the one (or more) artificially introduced intron(s) in theman-made gene.

The synthetic ligation reassembly method of this invention utilizes aplurality of nucleic acid building blocks, each of which preferably hastwo ligatable ends. The two ligatable ends on each nucleic acid buildingblock may be two blunt ends (i.e. each having an overhang of zeronucleotides), or preferably one blunt end and one overhang, or morepreferably still two overhangs.

A serviceable overhang for this purpose may be a 3′ overhang or a 5′overhang. Thus, a nucleic acid building block may have a 3′ overhang oralternatively a 5′ overhang or alternatively two 3′ overhangs oralternatively two 5′ overhangs. The overall order in which the nucleicacid building blocks are assembled to form a finalized chimeric nucleicacid molecule is determined by purposeful experimental design and is notrandom.

According to one preferred embodiment, a nucleic acid building block isgenerated by chemical synthesis of two single-stranded nucleic acids(also referred to as single-stranded oligos) and contacting them so asto allow them to anneal to form a double-stranded nucleic acid buildingblock.

A double-stranded nucleic acid building block can be of variable size.The sizes of these building blocks can be small or large depending onthe choice of the experimenter. Preferred sizes for building block rangefrom 1 base pair (not including any overhangs) to 100,000 base pairs(not including any overhangs). Other preferred size ranges are alsoprovided, which have lower limits of from 1 bp to 10,000 bp (includingevery integer value in between), and upper limits of from 2 bp to100,000 bp (including every integer value in between).

It is appreciated that current methods of polymerase-based amplificationcan be used to generate double-stranded nucleic acids of up to thousandsof base pairs, if not tens of thousands of base pairs, in length withhigh fidelity. Chemical synthesis (e.g. phosphoramidite-based) can beused to generate nucleic acids of up to hundreds of nucleotides inlength with high fidelity; however, these can be assembled, e.g. usingoverhangs or sticky ends, to form double-stranded nucleic acids of up tothousands of base pairs, if not tens of thousands of base pairs, inlength if so desired.

A combination of methods (e.g. phosphoramidite-based chemical synthesisand PCR) can also be used according to this invention. Thus, nucleicacid building block made by different methods can also be used incombination to generate a progeny molecule of this invention.

The use of chemical synthesis to generate nucleic acid building blocksis particularly preferred in this invention & is advantageous for otherreasons as well, including procedural safety and ease. No cloning orharvesting or actual handling of any biological samples is required. Thedesign of the nucleic acid building blocks can be accomplished on paper.Accordingly, this invention teaches an advance in procedural safety inrecombinant technologies.

Nonetheless, according to one preferred embodiment, a double-strandednucleic acid building block according to this invention may also begenerated by polymerase-based amplification of a polynucleotidetemplate. In a non-limiting exemplification, as illustrated in FIG. 2, afirst polymerase-based amplification reaction using a first set ofprimers, F₂ and R₁, is used to generate a blunt-ended product (labeledReaction 1, Product 1), which is essentially identical to Product A. Asecond polymerase-based amplification reaction using a second set ofprimers, F₁ and R₂, is used to generate a blunt-ended product (labeledReaction 2, Product 2), which is essentially identical to Product B.These two products are mixed and allowed to melt and anneal, generatingpotentially useful double-stranded nucleic acid building blocks with twooverhangs. In the example of FIG. 2, the product with the 3′ overhangs(Product C) is selected by nuclease-based degradation of the other 3products using a 3′ acting exonuclease, such as exonuclease III. It isappreciated that a 5′ acting exonuclease (e.g. red alpha) may be also beused, for example to select Product D instead. It is also appreciatedthat other selection means can also be used, includinghybridization-based means, and that these means can incorporate afurther means, such as a magnetic bead-based means, to facilitateseparation of the desired product.

Many other methods exist by which a double-stranded nucleic acidbuilding block can be generated that is serviceable for this invention;and these are known in the art and can be readily performed by theskilled artisan.

According to particularly preferred embodiment, a double-strandednucleic acid building block that is serviceable for this invention isgenerated by first generating two single stranded nucleic acids andallowing them to anneal to form a double-stranded nucleic acid buildingblock. The two strands of a double-stranded nucleic acid building blockmay be complementary at every nucleotide apart from any that form anoverhang; thus containing no mismatches, apart from any overhang(s).According to another embodiment, the two strands of a double-strandednucleic acid building block are complementary at fewer than everynucleotide apart from any that form an overhang. Thus, according to thisembodiment, a double-stranded nucleic acid building block can be used tointroduce codon degeneracy. Preferably the codon degeneracy isintroduced using the site-saturation mutagenesis described herein, usingone or more N,N,G/T cassettes or alternatively using one or more N,N,Ncassettes.

Contained within an exemplary experimental design for achieving anordered assembly according to this invention are:

-   -   1) The design of specific nucleic acid building blocks.

2) The design of specific ligatable ends on each nucleic acid buildingblock.

3) The design of a particular order of assembly of the nucleic acidbuilding blocks.

An overhang may be a 3′ overhang or a 5′ overhang. An overhang may alsohave a terminal phosphate group or alternatively may be devoid of aterminal phosphate group (having, e.g., a hydroxyl group instead). Anoverhang may be comprised of any number of nucleotides. Preferably anoverhang is comprised of 0 nucleotides (as in a blunt end) to 10,000nucleotides. Thus, a wide range of overhang sizes may be serviceable.Accordingly, the lower limit may be each integer from 1-200 and theupper limit may be each integer from 2-10,000. According to a particularexemplification, an overhang may consist of anywhere from 1 nucleotideto 200 nucleotides (including every integer value in between).

The final chimeric nucleic acid molecule may be generated bysequentially assembling 2 or more building blocks at a time until allthe designated building blocks have been assembled. A working sample mayoptionally be subjected to a process for size selection or purificationor other selection or enrichment process between the performance of twoassembly steps. Alternatively, the final chimeric nucleic acid moleculemay be generated by assembling all the designated building blocks atonce in one step.

Utility

The in vivo recombination method of this invention can be performedblindly on a pool of unknown hybrids or alleles of a specificpolynucleotide or sequence. However, it is not necessary to know theactual DNA or RNA sequence of the specific polynucleotide.

The approach of using recombination within a mixed population of genescan be useful for the generation of any useful proteins, for example,interleukin 1, antibodies, tPA and growth hormone. This approach may beused to generate proteins having altered specificity or activity. Theapproach may also be useful for the generation of hybrid nucleic acidsequences, for example, promoter regions, introns, exons, enhancersequences, 3′ untranslated regions or 5′ untranslated regions of genes.Thus this approach may be used to generate genes having increased ratesof expression. This approach may also be useful in the study ofrepetitive DNA sequences. Finally, this approach may be useful to mutateribozymes or aptamers.

Scaffold-like regions separating regions of diversity in proteins may beparticularly suitable for the methods of this invention. The conservedscaffold determines the overall folding by self-association, whiledisplaying relatively unrestricted loops that mediate the specificbinding. Examples of such scaffolds are the immunoglobulin beta barrel,and the four-helix bundle. The methods of this invention can be used tocreate scaffold-like proteins with various combinations of mutatedsequences for binding.

The equivalents of some standard genetic matings may also be performedby the methods of this invention. For example, a “molecular” backcrosscan be performed by repeated mixing of the hybrid's nucleic acid withthe wild-type nucleic acid while selecting for the mutations ofinterest. As in traditional breeding, this approach can be used tocombine phenotypes from different sources into a background of choice.It is useful, for example, for the removal of neutral mutations thataffect unselected characteristics (i.e. immunogenicity). Thus it can beuseful to determine which mutations in a protein are involved in theenhanced biological activity and which are not.

3.2.4. End-Selection

This invention provides a method for selecting a subset ofpolynucleotides from a starting set of polynucleotides, which method isbased on the ability to discriminate one or more selectable features (orselection markers) present anywhere in a working polynucleotide, so asto allow one to perform selection for (positive selection) &/or against(negative selection) each selectable polynucleotide. In a preferredaspect, a method is provided termed end-selection, which method is basedon the use of a selection marker located in part or entirely in aterminal region of a selectable polynucleotide, and such a selectionmarker may be termed an “end-selection marker”.

End-selection may be based on detection of naturally occurring sequencesor on detection of sequences introduced experimentally (including by anymutagenesis procedure mentioned herein and not mentioned herein) or onboth, even within the same polynucleotide. An end-selection marker canbe a structural selection marker or a functional selection marker orboth a structural and a functional selection marker. An end-selectionmarker may be comprised of a polynucleotide sequence or of a polypeptidesequence or of any chemical structure or of any biological orbiochemical tag, including markers that can be selected using methodsbased on the detection of radioactivity, of enzymatic activity, offluorescence, of any optical feature, of a magnetic property (e.g. usingmagnetic beads), of immunoreactivity, and of hybridization.

End-selection may be applied in combination with any method serviceablefor performing mutagenesis. Such mutagenesis methods include, but arenot limited to, methods described herein (supra and infra). Such methodsinclude, by way of non-limiting exemplification, any method that may bereferred herein or by others in the art by any of the following terms:“saturation mutagenesis”, “shuffling”, “recombination”, “re-assembly”,“error-prone PCR”, “assembly PCR”, “sexual PCR”, “crossover PCR”,“oligonucleotide primer-directed mutagenesis”, “recursive (&/orexponential) ensemble mutagenesis (see Arkin and Youvan, 1992)”,“cassette mutagenesis”, “in vivo mutagenesis”, and “in vitromutagenesis”. Moreover, end-selection may be performed on moleculesproduced by any mutagenesis &/or amplification method (see, e.g.,Arnold, 1993; Caldwell and Joyce, 1992; Stemmer, 1994; following whichmethod it is desirable to select for (including to screen for thepresence of) desirable progeny molecules.

In addition, end-selection may be applied to a polynucleotide apart fromany mutagenesis method. In a preferred embodiment, end-selection, asprovided herein, can be used in order to facilitate a cloning step, suchas a step of ligation to another polynucleotide (including ligation to avector). This invention thus provides for end-selection as a serviceablemeans to facilitate library construction, selection &/or enrichment fordesirable polynucleotides, and cloning in general.

In a particularly preferred embodiment, end-selection can be based on(positive) selection for a polynucleotide; alternatively end-selectioncan be based on (negative) selection against a polynucleotide; andalternatively still, end-selection can be based on both (positive)selection for, and on (negative) selection against, a polynucleotide.End-selection, along with other methods of selection &/or screening, canbe performed in an iterative fashion, with any combination of like orunlike selection &/or screening methods and serviceable mutagenesismethods, all of which can be performed in an iterative fashion and inany order, combination, and permutation.

It is also appreciated that, according to one embodiment of thisinvention, end-selection may also be used to select a polynucleotidethat is at least in part: circular (e.g. a plasmid or any other circularvector or any other polynucleotide that is partly circular), &/orbranched, &/or modified or substituted with any chemical group ormoiety. In accord with this embodiment, a polynucleotide may be acircular molecule comprised of an intermediate or central region, whichregion is flanked on a 5′ side by a 5′ flanking region (which, for thepurpose of end-selection, serves in like manner to a 5′ terminal regionof a non-circular polynucleotide) and on a 3′ side by a 3′ terminalregion (which, for the purpose of end-selection, serves in like mannerto a 3′ terminal region of a non-circular polynucleotide). As used inthis non-limiting exemplification, there may be sequence overlap betweenany two regions or even among all three regions.

In one non-limiting aspect of this invention, end-selection of a linearpolynucleotide is performed using a general approach based on thepresence of at least one end-selection marker located at or near apolynucleotide end or terminus (that can be either a 5′ end or a 3′end). In one particular non-limiting exemplification, end-selection isbased on selection for a specific sequence at or near a terminus suchas, but not limited to, a sequence recognized by an enzyme thatrecognizes a polynucleotide sequence. An enzyme that recognizes andcatalyzes a chemical modification of a polynucleotide is referred toherein as a polynucleotide-acting enzyme. In a preferred embodiment,serviceable polynucleotide-acting enzymes are exemplifiednon-exclusively by enzymes with polynucleotide-cleaving activity,enzymes with polynucleotide-methylating activity, enzymes withpolynucleotide-ligating activity, and enzymes with a plurality ofdistinguishable enzymatic activities (including non-exclusively, e.g.,both polynucleotide-cleaving activity and polynucleotide-ligatingactivity).

Relevant polynucleotide-acting enzymes thus also include anycommercially available or non-commercially available polynucleotideendonucleases and their companion methylases including those cataloguedat the website http://www.neb.com/rebase, and those mentioned in thefollowing cited reference (Roberts and Macelis, 1996). Preferredpolynucleotide endonucleases include—but are not limited to—type IIrestriction enzymes (including type IIS), and include enzymes thatcleave both strands of a double stranded polynucleotide (e.g. Not 1,which cleaves both strands at 5′ . . . GC/GGCCGC . . . 3′) and enzymesthat cleave only one strand of a double stranded polynucleotide, i.e.enzymes that have polynucleotide-nicking activity, (e.g. N. BstNB 1,which cleaves only one strand at 5′ . . . GAGTCNNNN/N . . . 3′).Relevant polynucleotide-acting enzymes also include type III restrictionenzymes.

It is appreciated that relevant polynucleotide-acting enzymes alsoinclude any enzymes that may be developed in the future, thoughcurrently unavailable, that are serviceable for generating a ligationcompatible end, preferably a sticky end, in a polynucleotide.

In one preferred exemplification, a serviceable selection marker is arestriction site in a polynucleotide that allows a corresponding type II(or type IIS) restriction enzyme to cleave an end of the polynucleotideso as to provide a ligatable end (including a blunt end or alternativelya sticky end with at least a one base overhang) that is serviceable fora desirable ligation reaction without cleaving the polynucleotideinternally in a manner that destroys a desired internal sequence in thepolynucleotide. Thus it is provided that, among relevant restrictionsites, those sites that do not occur internally (i.e. that do not occurapart from the termini) in a specific working polynucleotide arepreferred when the use of a corresponding restriction enzyme(s) is notintended to cut the working polynucleotide internally. This allows oneto perform restriction digestion reactions to completion or to nearcompletion without incurring unwanted internal cleavage in a workingpolynucleotide.

According to a preferred aspect, it is thus preferable to userestriction sites that are not contained, or alternatively that are notexpected to be contained, or alternatively that are unlikely to becontained (e.g. when sequence information regarding a workingpolynucleotide is incomplete) internally in a polynucleotide to besubjected to end-selection. In accordance with this aspect, it isappreciated that restriction sites that occur relatively infrequentlyare usually preferred over those that occur more frequently. On theother hand it is also appreciated that there are occasions whereinternal cleavage of a polypeptide is desired, e.g. to achieverecombination or other mutagenic procedures along with end-selection.

In accord with this invention, it is also appreciated that methods (e.g.mutagenesis methods) can be used to remove unwanted internal restrictionsites. It is also appreciated that a partial digestion reaction (i.e. adigestion reaction that proceeds to partial completion) can be used toachieve digestion at a recognition site in a terminal region whilesparing a susceptible restriction site that occurs internally in apolynucleotide and that is recognized by the same enzyme. In one aspect,partial digest are useful because it is appreciated that certain enzymesshow preferential cleavage of the same recognition sequence depending onthe location and environment in which the recognition sequence occurs.For example, it is appreciated that, while lambda DNA has 5 EcoR Isites, cleavage of the site nearest to the right terminus has beenreported to occur 10 times faster than the sites in the middle of themolecule. Also, for example, it has been reported that, while Sac II hasfour sites on lambda DNA, the three clustered centrally in lambda arecleaved 50 times faster than the remaining site near the terminus (atnucleotide 40,386). Summarily, site preferences have been reported forvarious enzymes by many investigators (e.g., Thomas and Davis, 1975;Forsblum et al, 1976; Nath and Azzolina, 1981; Brown and Smith, 1977;Gingeras and Brooks, 1983; Krüger et al, 1988; Conrad and Topal, 1989;Oller et al, 1991; Topal, 1991; and Pein, 1991; to name but a few). Itis appreciated that any empirical observations as well as anymechanistic understandings of site preferences by any serviceablepolynucleotide-acting enzymes, whether currently available or to beprocured in the future, may be serviceable in end-selection according tothis invention.

It is also appreciated that protection methods can be used toselectively protect specified restriction sites (e.g. internal sites)against unwanted digestion by enzymes that would otherwise cut a workingpolypeptide in response to the presence of those sites; and that suchprotection methods include modifications such as methylations and basesubstitutions (e.g. U instead of T) that inhibit an unwanted enzymeactivity. It is appreciated that there are limited numbers of availablerestriction enzymes that are rare enough (e.g. having very longrecognition sequences) to create large (e.g. megabase-long) restrictionfragments, and that protection approaches (e.g. by methylation) areserviceable for increasing the rarity of enzyme cleavage sites. The useof M.Fnu II (mCGCG) to increase the apparent rarity of Not Iapproximately twofold is but one example among many (Qiang et al, 1990;Nelson et al, 1984; Maxam and Gilbert, 1980; Raleigh and Wilson, 1986).

According to a preferred aspect of this invention, it is provided that,in general, the use of rare restriction sites is preferred. It isappreciated that, in general, the frequency of occurrence of arestriction site is determined by the number of nucleotides containedtherein, as well as by the ambiguity of the base requirements containedtherein. Thus, in a non-limiting exemplification, it is appreciatedthat, in general, a restriction site composed of, for example, 8specific nucleotides (e.g. the Not I site or GC/GGCCGC, with anestimated relative occurrence of 1 in 4⁸, i.e. 1 in 65,536, random8-mers) is relatively more infrequent than one composed of, for example,6 nucleotides (e.g. the Sma I site or CCC/GGG, having an estimatedrelative occurrence of 1 in 46, i.e. 1 in 4,096, random 6-mers), whichin turn is relatively more infrequent than one composed of, for example,4 nucleotides (e.g. the Msp I site or C/CGG, having an estimatedrelative occurrence of 1 in 44, i.e. 1 in 256, random 4-mers). Moreover,in another non-limiting exemplification, it is appreciated that, ingeneral, a restriction site having no ambiguous (but only specific) baserequirements (e.g. the Fin I site or GTCCC, having an estimated relativeoccurrence of 1 in 45, i.e. 1 in 1024, random 5-mers) is relatively moreinfrequent than one having an ambiguous W (where W=A or T) baserequirement (e.g. the Ava II site or G/GWCC, having an estimatedrelative occurrence of 1 in 4×4×2×4×4—i.e. 1 in 512-random 5-mers),which in turn is relatively more infrequent than one having an ambiguousN (where N=A or C or G or T) base requirement (e.g. the Asu I site orG/GNCC, having an estimated relative occurrence of 1 in 4×4×1×4×4, i.e.1 in 256-random 5-mers). These relative occurrences are consideredgeneral estimates for actual polynucleotides, because it is appreciatedthat specific nucleotide bases (not to mention specific nucleotidesequences) occur with dissimilar frequencies in specificpolynucleotides, in specific species of organisms, and in specificgroupings of organisms. For example, it is appreciated that the % G+Ccontents of different species of organisms are often very different andwide ranging.

The use of relatively more infrequent restriction sites as a selectionmarker include—in a non-limiting fashion—preferably those sites composedat least a 4 nucleotide sequence, more preferably those composed of atleast a 5 nucleotide sequence, more preferably still those composed atleast a 6 nucleotide sequence (e.g. the BamH I site or G/GATCC, the BglII site or A/GATCT, the Pst I site or CTGCA/G, and the Xba I site orT/CTAGA), more preferably still those composed at least a 7 nucleotidesequence, more preferably still those composed of an 8 nucleotidesequence nucleotide sequence (e.g. the Asc I site or GG/CGCGCC, the NotI site or GC/GGCCGC, the Pac I site or TTAAT/TAA, the Pme I site orGTTT/AAAC, the Srf I site or GCCC/GGGC, the Sse838 I site or CCTGCA/GG,and the Swa I site or ATTT/AAAT), more preferably still those composedof a 9 nucleotide sequence, and even more preferably still thosecomposed of at least a 10 nucleotide sequence (e.g. the BspG I site orCG/CGCTGGAC). It is further appreciated that some restriction sites(e.g. for class IIS enzymes) are comprised of a portion of relativelyhigh specificity (i.e. a portion containing a principal determinant ofthe frequency of occurrence of the restriction site) and a portion ofrelatively low specificity; and that a site of cleavage may or may notbe contained within a portion of relatively low specificity. Forexample, in the Eco57 I site or CTGAAG(16/14), there is a portion ofrelatively high specificity (i.e. the CTGAAG portion) and a portion ofrelatively low specificity (i.e. the N 16 sequence) that contains a siteof cleavage.

In another preferred embodiment of this invention, a serviceableend-selection marker is a terminal sequence that is recognized by apolynucleotide-acting enzyme that recognizes a specific polynucleotidesequence. In a preferred aspect of this invention, serviceablepolynucleotide-acting enzymes also include other enzymes in addition toclassic type II restriction enzymes. According to this preferred aspectof this invention, serviceable polynucleotide-acting enzymes alsoinclude gyrases, helicases, recombinases, relaxases, and any enzymesrelated thereto.

Among preferred examples are topoisomerases (which have been categorizedby some as a subset of the gyrases) and any other enzymes that havepolynucleotide-cleaving activity (including preferablypolynucleotide-nicking activity) &/or polynucleotide-ligating activity.Among preferred topoisomerase enzymes are topoisomerase I enzymes, whichis available from many commercial sources (Epicentre Technologies,Madison, Wis.; Invitrogen, Carlsbad, Calif.; Life Technologies,Gathesburg, Md.) and conceivably even more private sources. It isappreciated that similar enzymes may be developed in the future that areserviceable for end-selection as provided herein. A particularlypreferred topoisomerase I enzyme is a topoisomerase I enzyme of vacciniavirus origin, that has a specific recognition sequence (e.g. 5′ . . .AAGGG . . . 3′) and has both polynucleotide-nicking activity andpolynucleotide-ligating activity. Due to the specific nicking-activityof this enzyme (cleavage of one strand), internal recognition sites arenot prone to polynucleotide destruction resulting from the nickingactivity (but rather remain annealed) at a temperature that causesdenaturation of a terminal site that has been nicked. Thus for use inend-selection, it is preferable that a nicking site fortopoisomerase-based end-selection be no more than 100 nucleotides from aterminus, more preferably no more than 50 nucleotides from a terminus,more preferably still no more than 25 nucloetides from a terminus, evenmore preferably still no more than 20 nucleotides from a terminus, evenmore preferably still no more than 15 nucleotides from a terminus, evenmore preferably still no more than 10 nucleotides from a terminus, evenmore preferably still no more than 8 nucleotides from a terminus, evenmore preferably still no more than 6 nucleotides from a terminus, andeven more preferably still no more than 4 nucleotides from a terminus.

In a particularly preferred exemplification that is non-limiting yetclearly illustrative, it is appreciated that when a nicking site fortopoisomerase-based end-selection is 4 nucleotides from a terminus,nicking produces a single stranded oligo of 4 bases (in a terminalregion) that can be denatured from its complementary strand in anend-selectable polynucleotide; this provides a sticky end (comprised of4 bases) in a polynucleotide that is serviceable for an ensuing ligationreaction. To accomplish ligation to a cloning vector (preferably anexpression vector), compatible sticky ends can be generated in a cloningvector by any means including by restriction enzyme-based means. Theterminal nucleotides (comprised of 4 terminal bases in this specificexample) in an end-selectable polynucleotide terminus are thus wiselychosen to provide compatibility with a sticky end generated in a cloningvector to which the polynucleotide is to be ligated.

On the other hand, internal nicking of an end-selectable polynucleotide,e.g. 500 bases from a terminus, produces a single stranded oligo of 500bases that is not easily denatured from its complementary strand, butrather is serviceable for repair (e.g. by the same topoisomerase enzymethat produced the nick).

This invention thus provides a method—e.g. that is vacciniatopoisomerase-based &/or type II (or IIS) restriction endonuclease-based&/or type III restriction endonuclease-based &/or nicking enzyme-based(e.g. using N. BstNB I)— for producing a sticky end in a workingpolynucleotide, which end is ligation compatible, and which end can becomprised of at least a 1 base overhang. Preferably such a sticky end iscomprised of at least a 2-base overhang, more preferably such a stickyend is comprised of at least a 3-base overhang, more preferably stillsuch a sticky end is comprised of at least a 4-base overhang, even morepreferably still such a sticky end is comprised of at least a 5-baseoverhang, even more preferably still such a sticky end is comprised ofat least a 6-base overhang. Such a sticky end may also be comprised ofat least a 7-base overhang, or at least an 8-base overhang, or at leasta 9-base overhang, or at least a 10-base overhang, or at least 15-baseoverhang, or at least a 20-base overhang, or at least a 25-baseoverhang, or at least a 30-base overhang. These overhangs can becomprised of any bases, including A, C, G, or T.

It is appreciated that sticky end overhangs introduced usingtopoisomerase or a nicking enzyme (e.g. using N. BstNB I) can bedesigned to be unique in a ligation environment, so as to preventunwanted fragment reassemblies, such as self-dimerizations and otherunwanted concatamerizations.

According to one aspect of this invention, a plurality of sequences(which may but do not necessarily overlap) can be introduced into aterminal region of an end-selectable polynucleotide by the use of anoligo in a polymerase-based reaction. In a relevant, but by no meanslimiting example, such an oligo can be used to provide a preferred 5′terminal region that is serviceable for topoisomerase I-basedend-selection, which oligo is comprised of: a 1-10 base sequence that isconvertible into a sticky end (preferably by a vaccinia topoisomeraseI), a ribosome binding site (i.e. and “RB S”, that is preferablyserviceable for expression cloning), and optional linker sequencefollowed by an ATG start site and a template-specific sequence of 0-100bases (to facilitate annealment to the template in the polymerase-basedreaction). Thus, according to this example, a serviceable oligo (whichmay be termed a forward primer) can have the sequence: 5′[terminalsequence=(N)₁₋₁₀][topoisomerase I site &RBS=AAGGGAGGAG][linker=(N)₁₋₁₀₀][start codon and template-specificsequence=ATG(N)₀₋₁₀₀]3′.

Analogously, in a relevant, but by no means limiting example, an oligocan be used to provide a preferred 3′ terminal region that isserviceable for topoisomerase I-based end-selection, which oligo iscomprised of: a 1-10 base sequence that is convertible into a sticky end(preferably by a vaccinia topoisomerase 1), and optional linker sequencefollowed by a template-specific sequence of 0-100 bases (to facilitateannealment to the template in the polymerase-based reaction). Thus,according to this example, a serviceable oligo (which may be termed areverse primer) can have the sequence: 5′[terminalsequence=(N)₁₋₁₀][topoisomerase Isite=AAGGG][linker=(N)₁₋₁₀₀][template-specific sequence=(N)₀₋₁₀₀]3′.

It is appreciated that, end-selection can be used to distinguish andseparate parental template molecules (e.g. to be subjected tomutagenesis) from progeny molecules (e.g. generated by mutagenesis). Forexample, a first set of primers, lacking in a topoisomerase Irecognition site, can be used to modify the terminal regions of theparental molecules (e.g. in polymerase-based amplification). A differentsecond set of primers (e.g. having a topoisomerase I recognition site)can then be used to generate mutated progeny molecules (e.g. using anypolynucleotide chimerization method, such as interrupted synthesis,template-switching polymerase-based amplification, or interruptedsynthesis; or using saturation mutagenesis; or using any other methodfor introducing a topoisomerase I recognition site into a mutagenizedprogeny molecule as disclosed herein) from the amplified templatemolecules. The use of topoisomerase I-based end-selection can thenfacilitate, not only discernment, but selective topoisomerase 1-basedligation of the desired progeny molecules. Annealment of a second set ofprimers to thusly amplified parental molecules can be facilitated byincluding sequences in a first set of primers (i.e. primers used foramplifying a set parental molecules) that are similar to a toposiomeraseI recognition site, yet different enough to prevent functionaltoposiomerase I enzyme recognition. For example, sequences that divergefrom the AAGGG site by anywhere from 1 base to all 5 bases can beincorporated into a first set of primers (to be used for amplifying theparental templates prior to subjection to mutagenesis). In a specific,but non-limiting aspect, it is thus provided that a parental moleculecan be amplified using the following exemplary—but by no meanslimiting—set of forward and reverse primers: Forward Primer:5′ CTAGAAGAGAGGAGAAAACCATG(N)₁₀₋₁₀₀ 3′, and Reverse Primer:5′ GATCAAAGGCGCGCCTGCAGG(N)₁₀₋₁₀₀ 3′

According to this specific example of a first set of primers, (N)₁₀₋₁₀₀represents preferably a 10 to 100 nucleotide-long template-specificsequence, more preferably a 10 to 50 nucleotide-long template-specificsequence, more preferably still a 10 to 30 nucleotide-longtemplate-specific sequence, and even more preferably still a 15 to 25nucleotide-long template-specific sequence.

According to a specific, but non-limiting aspect, it is thus providedthat, after this amplification (using a disclosed first set of primerslacking in a true topoisomerase I recognition site), amplified parentalmolecules can then be subjected to mutagenesis using one or more sets offorward and reverse primers that do have a true topoisomerase Irecognition site. In a specific, but non-limiting aspect, it is thusprovided that a parental molecule can be used as templates for thegeneration of a mutagenized progeny molecule using the followingexemplary—but by no means limiting—second set of forward and reverseprimers: Forward Primer: 5′ CTAGAAGGGAGGAGAAAACCATG 3′ Reverse Primer:5′ GATCAAAGGCGCGCCTGCAGG 3′ (contains Asc I recognition sequence)

It is appreciated that any number of different primers sets notspecifically mentioned can be used as first, second, or subsequent setsof primers for end-selection consistent with this invention. Notice thattype II restriction enzyme sites can be incorporated (e.g. an Asc I sitein the above example). It is provided that, in addition to the othersequences mentioned, the experimentalist can incorporate one or moreN,N,G/T triplets into a serviceable primer in order to subject a workingpolynucleotide to saturation mutagenesis. Summarily, use of a secondand/or subsequent set of primers can achieve dual goals of introducing atopoisomerase I site and of generating mutations in a progenypolynucleotide.

Thus, according to one use provided, a serviceable end-selection markeris an enzyme recognition site that allows an enzyme to cleave (includingnick) a polynucleotide at a specified site, to produce aligation-compatible end upon denaturation of a generated single strandedoligo. Ligation of the produced polynucleotide end can then beaccomplished by the same enzyme (e.g. in the case of vaccinia virustopoisomerase 1), or alternatively with the use of a different enzyme.According to one aspect of this invention, any serviceable end-selectionmarkers, whether like (e.g. two vaccinia virus topoisomerase Irecognition sites) or unlike (e.g. a class II restriction enzymerecognition site and a vaccinia virus topoisomerase I recognition site)can be used in combination to select a polynucleotide. Each selectablepolynucleotide can thus have one or more end-selection markers, and theycan be like or unlike end-selection markers. In a particular aspect, aplurality of end-selection markers can be located on one end of apolynucleotide and can have overlapping sequences with each other.

It is important to emphasize that any number of enzymes, whethercurrently in existence or to be developed, can be serviceable inend-selection according to this invention. For example, in a particularaspect of this invention, a nicking enzyme (e.g. N. BstNB I, whichcleaves only one strand at 5′ . . . GAGTCNNNN/N . . . 3′) can be used inconjunction with a source of polynucleotide-ligating activity in orderto achieve end-selection. According to this embodiment, a recognitionsite for N. BstNB 1—instead of a recognition site for topoisomerase1—should be incorporated into an end-selectable polynucleotide (whetherend-selection is used for selection of a mutagenized progeny molecule orwhether end-selection is used apart from any mutagenesis procedure).

It is appreciated that the instantly disclosed end-selection approachusing topoisomerase-based nicking and ligation has several advantagesover previously available selection methods. In sum, this approachallows one to achieve direction cloning (including expression cloning).Specifically, this approach can be used for the achievement of: directligation (i.e. without subjection to a classicrestriction-purification-ligation reaction, that is susceptible to amultitude of potential problems from an initial restriction reaction toa ligation reaction dependent on the use of T4 DNA ligase); separationof progeny molecules from original template molecules (e.g. originaltemplate molecules lack topoisomerase I sites that not introduced untilafter mutagenesis), obviation of the need for size separation steps(e.g. by gel chromatography or by other electrophoretic means or by theuse of size-exclusion membranes), preservation of internal sequences(even when topoisomerase I sites are present), obviation of concernsabout unsuccessful ligation reactions (e.g. dependent on the use of T4DNA ligase, particularly in the presence of unwanted residualrestriction enzyme activity), and facilitated expression cloning(including obviation of frame shift concerns). Concerns about unwantedrestriction enzyme-based cleavages especially at internal restrictionsites (or even at often unpredictable sites of unwanted star activity)in a working polynucleotide—that are potential sites of destruction of aworking polynucleotide can also be obviated by the instantly disclosedend-selection approach using topoisomerase-based nicking and ligation.

3.3 Tunable

3.4 Transposons

3.4.1. General Applications

In one aspect, the present invention relates generally to the field oftransposable nucleic acid and for introducing genetic changes to nucleicacid. In one embodiment this invention relates to transposable elementsisolated from maize and a process for using the same to identify andisolate genes and to insert desired gene sequences into plants in aheritable manner. In another embodiment, this invention provides forusing transposons as a high molecular weight cloning system.

3.4.2. Specific Methodologies

3.4.2.1. Description of Transposable Elements

Transposable genetic elements are DNA sequences, found in a wide varietyof prokaryotic and eukaryotic organisms, that can move or transpose fromone position to another position in a genome. In vivo, intra-chromosomaltranspositions as well as transpositions between chromosomal andnon-chromosomal genetic material are known. In several systems,transposition is known to be under the control of a transposase enzymethat is typically encoded by the transposable element. The geneticstructures and transposition mechanisms of various transposable elementsare summarized, for example, in “Transposable Genetic Elements” in “TheEncyclopedia of Molecular Biology,” Kendrew and Lawrence, Eds.,Blackwell Science, Ltd., Oxford (1994), incorporated herein byreference.

Scientists have taken advantage of transposons to transport reportergenes for use in studying gene expression. These include transcriptional(Type I) fusions and translational (Type II) fusions. Transcriptionalfusions, unlike translational fusions, place a reporter gene under thecontrol of another promoter, but do not translationally fuse two proteindomains. Translational fusions have generally been made to link areporter gene carried inside the transposon to the translational frameof the target gene so that the reporter gene is expressed under directcontrol of the transcription and translation signals of the target geneof interest to study gene regulation. This requires that an open readingframe extend through the end of the transposable element to join aninternal reporter protein to external translational sequences. Thisusually results in complete inactivation of the target gene.

3.4.2.2. In Vitro Transposition Systems

In vitro transposition systems that utilize the particular transposableelements of bacteriophage Mu and bacterial transposon Tn10 have beendescribed, by the research groups of Kiyoshi Mizuuchi and NancyKleckner, respectively.

The bacteriophage Mu system was first described by Mizuuchi, K., “InVitro Transposition of Bacteria Phage Mu: A Biochemical Approach to aNovel Replication Reaction,” Cell:785-794 (1983) and Craigie, R. et al.,“A Defined System for the DNA Strand-Transfer Reaction at the Initiationof Bacteriophage Mu Transposition: Protein and DNA SubstrateRequirements,” P.N.A.S. U.S.A. 82: 7570-7574 (1985). The DNA donorsubstrate (mini-Mu) for Mu in vitro reaction normally requires six Mutransposase binding sites (three of about 30 bp at each end) and anenhancer sequence located about 1 kb from the left end. The donorplasmid must be supercoiled. Proteins required are Mu-encoded A and Bproteins and host-encoded HU and IHF proteins. Lavoie, B. D, and G.Chaconas, “Transposition of phage Mu DNA,” Curr. Topics Microbiol.Immunol. 204: 83-99 (1995). The Mu-based system is disfavored for invitro transposition system applications because the Mu termini arecomplex and sophisticated and because transposition requires additionalproteins above and beyond the transposase.

The Tn10 system was described by Morisato, D. and N. Kleckner, “Tn10Transposition and Circle Formation in vitro,” Cell 51: 101-111 (1987)and by Benjamin, H. W. and N. Kleckner, “Excision Of Tn10 from the DonorSite During Transposition Occurs By Flush Double-Strand Cleavages at theTransposon Termini,” P.N.A.S. U.S.A. 89: 4648-4652 (1992). The Tn10system involves a supercoiled circular DNA molecule carrying thetransposable element (or a linear DNA molecule plus E. coli IHFprotein). The transposable element is defined by complex 42 bp terminalsequences with IHF binding site adjacent to the inverted repeat. Infact, even longer (81 bp) ends of Tn10 were used in reportedexperiments. Sakai, J. et al., “Identification and Characterization ofPre-leavage Synaptic Complex that is an Early Intermediate in Tn10transposition,” E.M.B.O. J. 14: 4374-4383 (1995). In the Tn10 system,chemical treatment of the transposase protein is essential to supportactive transposition. In addition, the termini of the Tn10 element limitits utility in a generalized in vitro transposition system.

Both the Mu- and Tn10-based in vitro transposition systems are furtherlimited in that they are active only on covalently closed circular,supercoiled DNA targets. What is desired is a more broadly applicable invitro transposition system that utilizes shorter, more well definedtermini and which is active on target DNA of any structure (linear,relaxed circular, and supercoiled circular DNA).

According to alternative embodiments of this invention, the steps ofintroducing a plurality of traits and/or generating a set of mutagenizedorganisms may include the step of cloning. In a preferred embodiment,this invention provides that the step of cloning may comprise using aTn7 transposon-based system, such as but not limited to GPS-1. GPS-1 isan in vitro system (New England BioLabs Inc., Catalog #E7100S) that usesTnsABC Transposase to insert a transposon (Transprimer™) randomly intothe DNA target (See references Craig, N. L. (1996) Curr Top MicrobiolImmunol 204, 2748; Stellwagen, A. E. and Craig, N. L. (1997) Genetics145, 573-85; Biery, M. C., Stewart, F. J., Stellwagen, A. E., Raleigh,E. A. and Craig, N. L., (2000) Nucleic Acids Res 28, 1067-1077). Such asystem or modifications thereof that take advantage of transposoninsertion sequences can be utilized to aid in the cloning of highmolecular weight DNA. Such cloning approaches may also be provided forby this invention as a step in cell screening.

3.4.23. Importance of Transposons in Agriculture

Currently, there is a great deal of interest in the development of genetransfer vectors for use with agriculturally important plants (SeeOutlook for Science and Technology, The Next Five Years, Vol. III(National Science Foundation (1982); and O.T.A. Report, Impact ofApplied Genetics (1981)).

Although the United States presently has an excess productivity in theagricultural sector, this is recognized as a local and short termcondition. Thus, agricultural research and planning must be based onlong term considerations. The variety of problems surrounding increasesin population, degradation of prime farm land and decreasingavailability of good farm land necessitates the increased use ofmarginal land, as well as exogenous fertilizers and chemical pestcontrol compositions.

Classical plant breeding programs have thus far been successful inincreasing agricultural productivity. However, a substantial fraction ofthe increase in farm productivity experienced in the United States inthe past 40 years is attributable to the use of fertilizers and modernenergy intensive cultivation practices, both of which are increasinglycostly. The ability of plant breeding alone to sustain productivity is amatter of some question. Plant breeders are divided in their views onwhether genetic improvements will continue at the rate that has occurredover the past few decades or will begin to level out. Since suchquestions cannot be resolved a priori, it is prudent to explore avariety of additional means by which agronomically useful traits can beaccumulated and improved in major crop plants. The unconventional areasthat are presently receiving the most attention in the academic researchestablishment, as well as in both small and large firms withplant-oriented research programs are wide genetic crosses, tissueculture and the development of gene transfer systems that circumventfertility barriers.

In the past, many attempts have been made to transform plant cells withDNA from a variety of sources. The first unequivocal demonstration thatDNA transfer can and does occur in plants emerged from the workdescribed above on Agrobacterium tumefaciens Ti plasmid. However,Ti-plasmid mediated gene transfer is presently accomplished only indicotyledonous plants that interact with the plasmid's natural hostbacterium. Since most major crop species are monocotyledonous,ti-plasmid mediated gene transfer has limited applications.

3.4.2.3.1. Use of Transposons on the Ti Plasmid of Agrobacterium

In higher organisms, transposons have been, or are being, used inseveral ways. For example, transposons are used as mutagens on the Tiplasmid of Agrobacterium tumefaciens. That is, a method for usingbacterial transposons to cause insertion mutations in the Agrobacteriumtumefaciens Ti plasmid, the causative agent of crown gall disease indicotyledonous plants, has been developed. (See Zambriski, P., Goodman,H., Van Montagu, M. and Schell, J., Mobile Genetic Elements, J. Shapiro,Ed., (Academic Press) New York, pp. 506-535 (1983)). By this technique,it has become possible to identify the plasmid-borne genes that areresponsible for virulence, as well as those that are responsible for thetumorous transformation of plant cells caused by the Ti plasmid.Further, it has become possible to show by using transposable elements,that a portion of the Ti plasmid can be integrated into plant genomesand can act as a vehicle for transferring genes from virtually anyorganism to any dicotyledonous plant that is susceptible toAgrobacterium tumefaciens.

3.4.2.3.2. Use of Transposons in Maize

In maize, a monocotyledon, transposable elements were first geneticallyidentified in the mid-1940s. These elements have been studiedextensively and their genetic behavior has been extensively reviewed(See McClintock, B., Cold Spring Harbor Symp. Quant. Biol. 16: 1347(1951); McClintock, B., Cold Spring Harbor Symp. Quant. Biol. 21:197-216 (1956); McClintock, B., Brookhaven Symp. Biol. 18: 162-184(1965); Fincham, J. R. S., and Sastry, G. R. K., Ann. Rev. Genet. 8:15-50 (1974); and Fedoroff, N., Mobile Genetic Elements, J. Shapiro,Ed., (Academic Press) New York, pp. 1-63 (1983)).

It has been demonstrated that transposons are normal, although cryptic,residents of the maize genome and that upon activation, they areresponsible for various types of genetic rearrangements, includingchromosome breakage, deletions, duplications, inversions andtranslocations. In addition, it has been shown that certain common typesof unstable mutations, which have been studied for decades in both maizeand in other organisms, are attributable to the insertion of transposonsinto genes or genetic loci.

3.4.2.4.1. Type I Fusions

Type I transcriptional fusions have been used to study gene expressionand regulation by co-opting the native transcriptional signal to expressthe exogenous reporter gene. For example for gene expression in E. coli,yeast, and Drosophila development.

3.4.2.4.2. Type II Fusions

Type II fusions have also been used to study gene expression andregulation, but in this case not only co-opt the transcriptionalsignals, but any translational signals as well to express the reportergene. In this type of system the protein product usually only expressesthe activity of the reporter exogenous gene.

MudII elements are mini-Mu deletion elements which are type II Mutransposable elements. Examples of these include beta-galactosidasefusion elements, where a beta-galactosidase (lacZ) reporter gene isinserted via transposable elements to detect transcription andtranslation of regulated gene systems. This usually results in theinactivation of the targeted gene.

Two types of Mu protein fusions have been developed, lacZ fusionelements and nptI fusion elements (Symonds, Toussaint et al. (1987).Phage Mu) The lacZ elements have been used to study translationregulation, determination of the translation phase of target genes,infer the location of a protein fusion by hybrid protein size, determineamino terminal sequence, and raise antibodies to regions of the proteinof interest. By far the major goal of these studies has been todetermine mechanisms of gene expression in the studied organisms.

The nptI system was designed to perform transposon-tagging since nptI isknown to function as an aminoglycoside resistance gene in a variety oforganisms. Transposon tagging is a method of creating an mutant byinserting a transposon with a selectable marker into the gene ofinterest so that mutants which inactivate the gene can be identified andmaintained. This element is useful since it allows the nptI to bedirectly linked to the transcription/translation system of the organismbeing studied. In these studies there has been no emphasis on creatingnovel proteins with new activities using these transposable elements.More importantly, these Mu elements are restricted to makingamino-terminal fusions to the reporter protein. In these cases theinserted reporter gene is fused to the carboxy-end of the truncatedtargeted protein, terminating inside the Mu. If the transposable elementwere to insert before the amino terminal of a targeted gene, functionaltranslation could only occur on the marker gene by itself, and notranslation of the target gene would occur.

3.4.2.43. Problems with Mu

Unfortunately, available Mu elements had several problems. First, it hasnot been demonstrated that Mu elements can be readily used as a generalmethod for the development of fusion proteins with two active domains.Second, the Mu elements used thus far for creation of protein fusionscan not be used for construction of “carboxy-terminal” fusions sincethey did not have an open reading frame extending into the element.Third, the Mu elements previously used have long linker regions whichincorporate a 40 amino acid linker between the fused domains. This couldcreate protein folding problems or unwanted domain interactions. Fourththe currently existing Mu elements had only a single restriction sitefor the insertion of protein domains. Finally, although Mu elementswhich had deleted ends existed, it was not known whether they wouldtranspose well with additional sequences added in such close proximityto the right end and whether the intervening linker region which wouldjoin the two protein domains would interfere with the construction ofactive chimeric proteins.

3.4.2.5. Other Transposons

Other transposons have been used in a similar manner as Mu to create lacfusions to study gene expression. These include Tn10 and Tn917 (Berg andHowe. (1989). Mobile DNA).

The Tn5 element has also been used to construct phoA fusions in vivo.Fusions with alkaline phosphatase (phoA) have also been used to probethe structure of membrane bound proteins (Lloyd and Kadner. (1990). J.Bacteriol. 172: 1688-93.). In general, these transposons have been usedto study the membrane topology structure of a particular gene andprotein secretion. The resultant fusion proteins are also limited toamino-terminal fusion of the reporter PhoA reporter protein resulting infusion at the carboxy end of the targeted gene.

In general, these types of fusions have been applied to the study ofgene expression. These elements were constructed with truncated markerproteins that extend through the end of the transposon. Transposition ofthe element can create an in-frame fusion with a target gene, therebyactivating expression. Mini-Mu elements are used because they transposeat high frequencies, insert randomly, and can be packaged along with atarget plasmid and transduced to a new cell (Symonds, Toussaint et al.(1987). Phage Mu). Some of the more pertinent work that has been done inthe area of transposable elements are detailed in the following.

Namgoong et al., (1994), teach that the Mu transposition reactionattachment sites attL and attR can promote the assembly of higher ordercomplexes held together by non-covalent protein-DNA and protein-proteininteractions. (Namgoong, Jayaram et al. (1994). J Mol. Biol. 238:514-527.) Harel et al., (1990), teach that in Mu helper-mediatedtransposition packaging the left end contains an essential domaindefined by nucleotides 1 to 54 of the left end (attL). At the right end(attR), they teach that the essential sequences for transpositionrequire not more than the first 62 base pairs (bp), although thepresence of sequences between 63 and 117 bp from the right end increasetransposition frequency about 15-fold. (Harel, Dupliessis et al. (1990).Arch Microbiol. 154: 67-72.) Groenen and van de Putte (1986), teach thatthe Mu A protein binds weakly to sequences between nucleotides 1 to 30on the right end (RI) and between nucleotides 110 and 135 on the leftend (L2). Mutations in these weak A binding sites have a greater effecton transposition than mutations of corresponding base pairs in thestronger A binding sites, located adjacent to these weak A bindingsites. (Groenen and van de Putte. (1986). J Mol. Biol. 189: 597-602.)

Groenen and et al. (1985) teach the DNA sequences at the end of thegenome of bacteriophage Mu that are essential for transposition.(Groenen, Timmers et al. (1985). Proc Natl Acad Sci, USA. 82:2087-2091.)

Lloyd and Kadner teach the how to probe the topology of the uhpT sugarphosphate transporter using a Tn5phoA element. (Lloyd and Kadner.(1990). J. Bacteriol. 172: 1688-93.)

Phage Mu (1987), Cold Spring Harbor Laboratory Press (Symonds, et aleds.) teaches general methods for handling and working withbacteriophage Mu as a transposon, and describes the various uses ofmini-Mu elements including the construction of Mu transcriptional andtranslational fusions.

Silhavy and Beckwith (1985) teaches the various uses of lac fusions forthe study of biological problems. (Silhavy and Beckwith. (1985).Microbiol Rev. 49: 398-418.)

Mobile DNA, (1989), American Society for Microbiology, Publishers.(Berg, Howe, eds) describes transposons.

Casadaban, et al. (1983) Methods in Enzymol, provides a good generalreview of beta-galactosidase gene fusions for the study of geneexpression. (Casadaban, Martinez-Arias et al. (1983). Recombinant DNA.Methods in Enzymology. 100: 293-308.)

3.4.3. In Vitro Transposition System

The present invention is summarized in that an in vitro transpositionsystem comprises a preparation of a suitably modified transposase ofbacterial transposon Tn5, a donor DNA molecule that includes atransposable element, a target DNA molecule into which the transposableelement can transpose, all provided in a suitable reaction buffer.

3.4.3.1. Donor DNA Molecule: Transposable DNA Sequence of Interest

The transposable element of the donor DNA molecule is characterized as atransposable DNA sequence of interest, the DNA sequence of interestbeing flanked at its 5′- and 3′-ends by short repeat sequences that areacted upon in trans by Tn5 transposase.

3.4.3.1.1. Modified Transposase Enzyme Comprises Two Classes ofDifferences

From Wild Type Tn5 Transposase The invention is further summarized inthat the suitably modified transposase enzyme comprises two classes ofdifferences from wild type Tn5 transposase, where each class has aseparate measurable effect upon the overall transposition activity ofthe enzyme and where a greater effect is observed when bothmodifications are present. The suitably modified enzyme both (1) bindsto the repeat sequences of the donor DNA with greater avidity than wildtype Tn5 transposase (“class (1) mutation”) and (2) is less likely thanthe wild type protein to assume an inactive multimeric form (“class (2)mutation”). A suitably modified Tn5 transposase of the present inventionthat contains both class (1) and class (2) modifications induces atleast about 100-fold (.+/−. 10%) more transposition than the wild typeenzyme, when tested in combination in an in vivo conjugation assay asdescribed by Weinreich, M. D., “Evidence that the cis Preference of theTn5 Transposase is Caused by Nonproductive Multimerization,” Genes andDevelopment 8: 2363-2374 (1994), incorporated herein by reference. Underoptimal conditions, transposition using the modified transposase may behigher. A modified transposase containing only a class (1) mutationbinds to the repeat sequences with sufficiently greater avidity than thewild type Tn5 transposase that such a Tn5 transposase induces about 5-to 50-fold more transposition than the wild type enzyme, when measuredin vivo. A modified transposase containing only a class (2) mutation issufficiently less likely than the wild type Tn5 transposase to assumethe multimeric form that such a Tn5 transposase also induces about 5- to50-fold more transposition than the wild type enzyme, when measured invivo.

3.4.4. Transposons—Specialized Applications

3.4.4.1. In Vitro System for Introducing any Transposable Element from aDonor DNA into a Target DNA

It will be appreciated that this technique provides a simple, in vitrosystem for introducing any transposable element from a donor DNA into atarget DNA. It is generally accepted and understood that Tn5transposition requires only a pair of OE termini, located to either sideof the transposable element. These OE termini are generally thought tobe 18 or 19 bases in length and are inverted repeats relative to oneanother. Johnson, R. C., and W. S. Reznikoff, Nature 304: 280 (1983),incorporated herein by reference. The Tn5 inverted repeat sequences,which are referred to as “termini” even though they need not be at thetermini of the donor DNA molecule, are well known and understood.

Apart from the need to flank the desired transposable element withstandard Tn5 outside end (“OE”) termin, few other requirements on eitherthe donor DNA or the target DNA are envisioned. It is thought that Tn5has few, if any, preferences for insertion sites, so it is possible touse the system to introduce desired sequences at random into target DNA.Therefore, it is believed that this method, employing the modifiedtransposase described herein and a simple donor DNA, is broadlyapplicable to introduce changes into any target DNA, without regard toits nucleotide sequence.

3.4.4.2. Generation of Functional Fusion Protein Products

The instant invention provides constructs and methods for the rapid andefficient generation of functional fusion protein products with eithercarboxy-terminal or amino-terminal fusions. Functional fusion proteinsare those which retain some of the activity of the original domains,and/or those which have a newly created activity. Throughout thisspecification, reference is made to two types of fusions: carboxyterminal fusions and amino terminal fusions. In this text we use aminoand carboxy terminal fusions to refer to the end of the domain inside ofthe Mu elements which is fused to the target molecule. Thus, carboxyterminal fusion elements are those with a protein domain inside of theMu which extends out of the Mu element such that the exogenous proteinis fused to the amino end of the endogenous protein. The amino terminalfusion elements are those that create fusions with a target geneextending into the element such that the exogenous protein is fused tothe carboxy terminal of the endogenous protein (as referenced withinU.S. Pat. Nos. 5,965,443 and 4,732,856).

3.4.5. Additional Applications

It is envisioned that in addition to the uses specifically noted herein,other applications will be apparent to the skilled molecular biologist.In particular, methods for introducing desired mutations intoprokaryotic or eukaryotic DNA are very desirable. For example, atpresent it is difficult to knock out a functional eukaryotic gene byhomologous recombination with an inactive version of the gene thatresides on a plasmid. The difficulty arises from the need to flank thegene on the plasmid with extensive upstream and downstream sequences.Using this system, however, an inactivating transposable elementcontaining a selectable marker gene (e.g., neo) can be introduced invitro into a plasmid that contains the gene that one desires toinactivate. After transposition, the products can be introduced intosuitable host cells. Using standard selection means, one can recoveronly cell colonies that contain a plasmid having the transposableelement. Such plasmids can be screened, for example by restrictionanalysis, to recover those that contain a disrupted gene. Such clonescan then be introduced directly into eukaryotic cells for homologousrecombination and selection using the same marker gene.

Also, one can use the system to readily insert a PCR-amplified DNAfragment into a vector, thus avoiding traditional cloning stepsentirely. This can be accomplished by (1) providing suitable a pair ofPCR primers containing OE termini adjacent to the sequence-specificparts of the primers, (2) performing standard PCR amplification of adesired nucleic acid fragment, (3) performing the in vitro transpositionreaction of the present invention using the double-stranded products ofPCR amplification as the donor DNA.

The invention is not intended to be limited to the foregoing examples,but to encompass all such modifications and variations as come withinthe scope of the appended claims.

3.5. Homologous Recombination

3.5.1. Homologous Recombination for the Generation of Deletions andInsertions

The invention relates to compositions and methods of rapidly evolvingspecific protein domains using a library of nucleic acid filaments and arecombinase polypeptide or peptide. The invention relates tocompositions and methods for targeting sequence modifications in one ormore genes of a related family of genes using enhanced homologousrecombination techniques. The invention also relates to compositions andmethods for isolating and identifying novel members of homologoussequences families. These techniques may be used to create animal orplant models of disease as well as to identify new targets for drug orpathogen screening.

3.5.1.1. Evolution of Genes

In nature, the evolution of genes and their encoded proteins occursthrough an equilibrium between recombination or mutation and selection.While evolution in nature takes millions of years, in vitro methods andcompositions have been developed to evolve proteins, with improved andnovel functions, in a matter of hours to days.

3.5.1.1.1. Through Mutagenesis

Current in vitro gene evolution methods utilize repeated cycles ofrandom mutagenesis or random nicking and mixing of related genescontaining mutations in PCR-based random recombination. These methodscouple multiple rounds of in vitro mutagenesis with screening systems toproduce and identify the desired mutants or recombinants (Stemmer 1994.Nature 370: 389-391; Arnold 1996. Chemical Engineering Science 51:5091-5102). Research has shown, however, that the mutations of interesttend to occur in those regions or domains that are directly related tofunction (Chen and Arnold. 1993. PNAS USA 90: 5618-5622).

However, these mutagenesis methods produce random mutations throughoutthe gene of interest which requires the need to screen large numbers ofuninteresting or deleterious mutants. The labor-intensive and timeconsuming aspects of these methods are further complicated by thenecessity of multiple rounds of subcloning and can be extremelychallenging if the screening system is complex and does not utilize aselection system.

3.5.1.1.2. Homologous Recombination (HR)

Homologous recombination (HR) is defined as the exchange of homologousor similar DNA sequences between two DNA molecules. As essential featureof HR is that the enzymes responsible for the recombination event canpair any homologous sequences as substrates. The ability of HR totransfer genetic information between DNA molecules makes targetedhomologous recombination a very powerful method in genetic engineeringand gene manipulation. Both genetic and cytological studies haveindicated that such a crossing-over process occurs between pairs ofhomologous chromosomes during meiosis in higher organisms.

3.5.1.1.3. Site-Specific Recombination

Alternatively, in site-specific recombination, exchange occurs at aspecific site, as in the integration of phage A into the E colichromosome and the excision of lambda DNA from it. Site-specificrecombination involves specific inverted repeat sequences; e.g. theCre-loxP and FLP—FRT systems. Within these sequences there is only ashort stretch of homology necessary for the recombination event, but notsufficient for it. The enzymes involved in this event generally cannotrecombine other pairs of homologous (or nonhomologous) sequences, butact specifically.

3.5.1.1.4. Advantage of Homologous Recombination over Site-Specific

Recombination Although both site-specific recombination and homologousrecombination are useful mechanisms for genetic engineering of DNAsequences, targeted homologous recombination provides a basis fortargeting and altering essentially any desired sequence in a duplex DNAmolecule, such as targeting a DNA sequence in a chromosome forreplacement by another sequence. Site-specific recombination has beenproposed as one method to integrate transfected DNA at chromosomallocations having specific recognition sites (O'Gorman et al. (1991)Science 251: 1351; Onouchi et al. (1991) Nucleic Acids Res. 19: 6373).Unfortunately, since this approach requires the presence of specifictarget sequences and recombinases, its utility for targetingrecombination events at any particular chromosomal location is severelylimited in comparison to targeted general recombination.

3.5.1.1.5. HR to Create Transgenic Plants, Animals, and Organisms

Homologous recombination has also been used to create transgenic plantsand animals. Transgenic organisms contain stably integrated copies ofgenes or gene constructs derived from another species in the chromosomeof the transgenic organism. In addition, gene targeted animals can begenerated by introducing cloned DNA constructs of the foreign genes intototipotent cells by a variety of methods, including homologousrecombination. For example, animals that develop from geneticallyaltered totipotent cells can contain the foreign gene in all somaticcells and also in germ-line cells.

3.5.1.1.5.1. Current Methods Using Embryonic Stem Cells

Currently methods for producing transgenic and targeted animals havebeen performed on totipotent embryonic stem cells (ES) and withfertilized zygotes. ES cells have an advantage in that large numbers ofcells can be manipulated easily by homologous recombination in vitrobefore they are used to generate targeted animals. Currently, however,only embryonic stem cells from mice have been shown to contribute to thegerm line. Alternatively, DNA can also be introduced into fertilizedoocytes by micro-injection into pronuclei which are then transferredinto the uterus of a pseudo-pregnant recipient animal to develop toterm. The ability of mammalian and human cells to incorporate exogenousgenetic material into genes residing on chromosomes has demonstratedthat these cells have the general enzymatic machinery for carrying outhomologous recombination required between resident and introducedsequences. These targeted recombination events can be used to correctmutations at known sites, replace genes or gene segments with defectiveones, or introduce foreign genes into cells.

3.5.1.1.5.2. Frequency and Efficiency of HR

HR can be used to add subtle mutations at known sites, replace wild typegenes or gene segments or introduce completely foreign genes into cells.However, HR efficiency is very low in living cells and is dependent onseveral parameters, including the method of DNA delivery, how it ispackaged, its size and conformation, DNA length and position ofsequences homologous to the target, and the efficiency of hybridizationand recombination at chromosomal sites. These variables severely limitthe use of conventional HR approaches for gene evolution in cell basedsystems. (Kucherlapati et al., 1984. PNAS; USA 81: 3153-3157; Smithieset al. 1985. Nature 317: 230-234; Song et al. 1987. PNAS USA 84:6820-6824; Doetschman et al. 1987. Nature 330: 576-578; Kim andSmithies. 1988. Nuc. Acids. Res. 16: 8887-8903; Koller and Smithies.1989. PNAS USA 86: 8932-8935; Shesely et al. 1991. PNAS USA 88:4294-4298; Kim et al. 1991. Gene 103: 227-233).

3.5.1.1.5.2.1. Enhancement by the Presence of Recombinase Activities

The frequency of HR is significantly enhanced by the presence ofrecombinase activities in cellular and cell free systems. Severalproteins or purified extracts that promote HR (i.e., recombinaseactivity) have been identified in prokaryotes and eukaryotes (Cox andLehman., 1987. Annu. Rev. Biochem. 56: 229-262; Radding. 1982. AnnualReview of Genetics 16: 405-547; McCarthy et al. 1988. PNAS; USA 85:5854-5858). These recombinases promote one or more steps in theformation of homologously-paired intermediates, strand-exchange, and/orother steps. Recent advances have resulted in techniques allowingenhanced homologous recombination (EHR) using recombinases such as recAand Rad51 and single-stranded nucleic acids that have sequenceheterologies. This allows sequence modifications to be specificallytargeted to virtually any genomic position. See for example, PCTUS93/03868 and PCT US98/05223, both of which are expressly incorporatedherein by reference.

3.5.1.1.5.2.1.1. Recombinase Rec A: A Bacterial Protein that CatalysesHomologous Pairing and Strand Exchange Between Two Homologous DNAMolecules

The most studied recombinase to date is the RecA recombinase of E coli,which is involved in homology search and strand exchange reactions (Coxand Lehman, 1987, supra). The bacterial RecA protein (Mr 37,842)catalyses homologous pairing and strand exchange between two homologousDNA molecules (Kowalczykowski et al. 1994. Microbiol. Rev. 58: 401-465;West. 1992. Annu. Rev. Biochem. 61: 603-640); Roca and Cox. 1990. CRCCit. Rev. Biochem. Mol. Biol.: 415-455; Radding. 1989. Biochim. Biophys.Acta. 1008: 131-145; Smith. 1989. Cell 58: 807-809).

RecA protein binds cooperatively to any given sequence ofsingle-stranded DNA with a stochiometry of one RecA protein monomer forevery three to four nucleotides in DNA (Cox and Lehman, 1987, supra).This forms unique right handed helical nucleoprotein filaments in whichthe DNA is extended by 1.5 times its usual length (Yu and Egelman 1992.J. Mol. Biol. 227: 334-346). These nucleoprotein filaments, which arereferred to as DNA probes, are crucial “homology search engines” whichcatalyze DNA pairing. Once the filament finds its homologous target genesequence, the DNA probe strand invades the target and forms a hybrid DNAstructure, referred to as a joint molecule or D-loop (DNA displacementloop) (McEntee et al. 1979. PNAS USA 76: 2615-2619; Shibata et al. 1979.PNAS USA 76: 1638-1642). The phosphate backbone of DNA inside the RecAnucleoprotein filaments is protected against digestion byphosphodiesterases and nucleases.

RecA protein is the prototype of a universal class of recombinaseenzymes which promote probe-target pairing reactions. Recently, geneshomologous to E. coli RecA (the Rad51 family of proteins) were isolatedfrom all groups of eukaryotes, including yeast and humans. Rad5]proteinpromotes homologous pairing and strand invasion and exchange betweenhomologous DNA molecules in a similar manner to RecA protein (Sung.1994. Science 265: 1241-1243; Sung and Robberson. 1995. Cell 82:453-461; Gupta et al. 1997. PNAS USA 94: 463-468; Baumann et al. 1996.Cell 87: 757-766).

3.5.1.1.5.3. Functional Genomics: the Correlation of Genotype andPhenotype

One area of pressing interest in biology is within the area of“functional genomics”, i.e. the correlation of genotype and phenotype.This requires animal systems, since phenotypic changes must be evaluatedin vivo. Similarly, and related to this idea, is the elucidation andcharacterization of gene families, i.e. genes or proteins that arestructurally related, i.e. they have sequence homologies between themembers of the family. Since presumably many, if not most, diseasestates are caused by multiple gene interactions, the ability to evaluateinteractions among genes, and particularly within or between genefamilies, at the phenotype level, would be extremely valuable.

The functional genomics tools that allow facile identification andengineering of gene family members in animals and cells, however, arenot yet available. While the amino acid sequence motifs shared betweengene family members may be identical, due to degeneracy in the DNA code,the DNA sequence identity may be significantly less. Hence, onecriterion necessary for genetic modifications of gene family members isdevelopment of homologous recombination technologies that can be used toclone and modify similar DNA sequences that share little sequenceidentity. This is particularly important since homologous recombinationin cells normally requires significant sequence identity to workefficiently. Relaxing the amount of sequence identity needed forhomologous recombination allows greater flexibility to target relatedgenes for creating transgenic animals and cells containing modificationsin gene family consensus sequences, and also will allow the rapidcloning, generation of gene family specific libraries, and evolution ofgene family members. Accordingly, it is an object of the invention toprovide an efficient method of domain specific gene evolution thatgenerates maximal diversity but increases the probability of identifyinga gene of interest.

3.5.2. Domain Specific Gene Evolution

3.5.2.1. Domain Specific Gene Evolution—Comprising Forming a Pluralityof Recombination Intermediates Comprising a Target Nucleic Acid Encodingan Amino Acid Sequence of Interest, a Recombinase and a Plurality ofTargeting Polynucleotides

The present invention provides methods of domain specific gene evolutioncomprising forming a plurality of recombination intermediates comprisinga target nucleic acid encoding an amino acid sequence of interest, arecombinase and a plurality of targeting polynucleotides. The targetingpolynucleotides are substantially complementary to each other and eachcomprises a homology clamp that substantially correspond to or issubstantially complementary to a predetermined sequence of the targetnucleic acid and comprise random or degenerate sequences. Thepredetermined sequence encodes a domain of the amino acid sequence. Themethod further comprises contacting the intermediate with arecombination proficient cell, whereby a library of altered targetnucleic acids are produced. The altered target nucleic acids areexpressed in the cell to generate a pool of variant amino acidsequences. The method further comprises selecting and isolating a cellcomprising an altered target nucleic acid that expressed a variant aminoacid having a desired activity.

3.5.2.2. Comprising Forming a Recombination Intermediate Comprising aTarget Nucleic Acid Encoding an Amino Acid Sequence of Interest, aRecombinase and a Pair of Targeting Poly Nucleotides

In another aspect of the invention, a method of domain specific geneevolution comprises forming a recombination intermediate comprising atarget nucleic acid encoding an amino acid sequence of interest, arecombinase and a pair of targeting polynucleotides. The targetingpolynucleotides are substantially complementary to each other and eachcomprises a homology clamp that substantially corresponds to or issubstantially complementary to a predetermined sequence of the targetnucleic acid. The predetermined sequence encodes a domain of the aminoacid sequence. The method further comprises contacting the intermediatewith a single-strand specific nuclease or junction-specific nuclease toform a nicked or open-ended target nucleic acid. The regions adjacent tothe hybridized region or junctions are susceptible to nucleases. Thetarget nucleic acid is reassembled and recombined to produce a libraryof altered target nucleic acids. The target nucleic acids are expressedto generate a pool of variant amino acid sequences. The variant aminoacid sequences are selected and characterized to identify an alteredtarget nucleic acid encoding a variant amino acid sequence of interest.

In a further aspect, each method is repeated one or more times tofurther evolve a variant amino acid sequence having a desired activity.In yet another aspect, more than one domain or a protein is evolvedsimultaneously.

3.5.2.3. Compositions

It is an object of the present invention to provide compositionscomprising at least one recombinase and at least two single-strandedtargeting polynucleotides which are substantially complementary to eachother and each having a consensus homology clamp for a gene family.

In an additional aspect, the invention provides compositions comprisingat least one recombinase and a plurality of pairs of single strandedtargeting polynucleotides, where the plurality of pairs comprises a setof degenerate probes encoding the consensus sequence.

In a further aspect, the invention provides kits comprising thecompositions of the invention and at least one reagent.

3.5.2.4. Methods for Targeting a Sequence Modification in at Least OneMember of a Consensus Family of Genes in a Cell by HomologousRecombination.

In an additional aspect, the invention provides methods for targeting asequence modification in at least one member of a consensus family ofgenes in a cell by homologous recombination. The method comprisesintroducing into at least one cell at least one recombinase and at leasttwo single-stranded targeting polynucleotides which are substantiallycomplementary to each other and each having a consensus homology clampfor the family. The method can additionally comprise identifying atarget cell having a targeted sequence modification.

3.5.2.4.1. Methods of Making a Non-Human Organism with a TargetedSequence Modification in at Least One Member of a Gene Family

In a further aspect, the invention provides methods of making anon-human organism with a targeted sequence modification in at least onemember of a gene family. The method comprises introducing into a cell atleast one recombinase and at least two single-stranded targetingpolynucleotides which are substantially complementary to each other andeach having a consensus homology clamp for said family. The cell is thensubjected to conditions that result in the formation of an animal, andthe animal has at least one modification in at least one member of aconsensus family of genes.

In a further aspect, the invention provides non-human organismscontaining a sequence modification in an endogenous consensus functionaldomain of a gene member of a gene family.

3.5.2.5. Methods of Isolating a Member of a Gene Family Comprising aProtein Consensus Sequence

In an additional aspect, the invention provides methods of isolating amember of a gene family comprising a protein consensus sequence. Themethod comprises adding to a complex mixture of nucleic acids at leastone recombinase and at least two single-stranded targetingpolynucleotides which are substantially complementary to each other andeach having a consensus homology clamp for said family. At least one ofthe targeting polynucleotides comprises a purification tag. The methodis done under conditions whereby the targeting polynucleotides form acomplex with the member, and the family member is isolated using saidpurification tag. The complex nucleic acid mixture may be a cDNAlibrary, a cell, RNA or a restriction endonucleases genomic digest.

3.5.3. Targeting a Predetermined Nucleic Acid Sequence that Encodes aSpecific Protein Domain, to Make a Plurality of Targeted SequenceModifications

The present invention provides methods and compositions for domainspecific gene evolution. In one aspect of the invention, the methodcomprises targeting a predetermined nucleic acid sequence that encodes aspecific protein domain, to make a plurality of targeted sequencemodifications. That is, by targeting the recombinogenic probes of theinvention to particular protein domains, gene evolution and selectionare targeted to specific domains known or believed to harbor specificactivities or functions. These methods create maximal diversity inspecific domains of interest, thereby, decreasing the size of thelibrary of mutations that are to be screened and increasing theprobability of finding a gene with improved or desired attributes.Therefore, the libraries of the present invention are enriched foradvantageous or interesting mutations or recombinant sequence(s).

3.5.3.1. Combining A Plurality Of Pairs Of Single-Stranded TargetingPoly Nucleotides, A Predetermined Target Nucleic Acid, And A RecombinaseTo Form A Polynucleotide: Target Nucleic Acid Complex.

Accordingly, the methods comprise combining a plurality of pairs ofsingle-stranded targeting poly nucleotides, a predetermined targetnucleic acid, and a recombinase to form a polynucleotide: target nucleicacid complex. The targeting polynucleotides comprise at least onehomology clamp for targeting a predetermined domain of a target nucleicacid and randomized or degenerate sequences. The complex is optionallyintroduced into a plurality of recombination proficient cells whichcatalyze strand exchange and homologous recombination intracellularly toproduce a library of modified nucleic acids. Cells are selected andisolated that comprise a modified nucleic acid that encodes apolypeptide having a desired property. The process is preferablyrepeated iteratively to further evolve the target domain of interest.

3.5.3.2. Domain Specific DNA Nicking

In another aspect of the invention, methods of domain specific DNAnicking are provided for domain specific gene evolution. This methodcomprises combining a pair of single-stranded targeting polynucleotides,a predetermined target nucleic acid, and a recombinase to form apolynucleotide: target nucleic acid complex. The targetingpolynucleotides are substantially complementary and comprise at leastone homology clamp for targeting a predetermined domain of a targetnucleic acid. The polynucleotide: target nucleic acid complex is treatedwith a single-strand specific nuclease, which preferentially nicks theregions flanking the polynucleotide: target nucleic acid complex region(Ferrin and Camerini-Otero. 1991. Science. 254 1494-1497). That is, thedomain is protected from recombination by the initial presence of therecombinase in the complex. The nuclease is inactivated and the complexdissociated. The nicked target nucleic acid is reassembled andrecombined by PCR to produce a library of nucleic acids withpreferential modifications in the nicked regions. The library ofmodified nucleic acids can be introduced into a host cell and expressed.Cells are selected and isolated that comprise a modified nucleic acidthat encodes a polypeptide having a desired property. This process isrepeated iteratively to further evolve the predetermined targeted domainof interest.

In each of the methods described above, single domains and optionallymultiple domains are targeted. The methods and compositions describedabove are optionally used in combination for domain specific geneevolution. For example, individual or multiple rounds of domain specificDNA nicking are followed or interspersed with one or more rounds ofdomain specific evolution employing a plurality of targetingpolynucleotides described above.

The methods of the present invention also avoid multiple subcloningsteps. This is particularly relevant when large complex vectors such aslambda, BACS, PACS, YACS, MACS and other genomic DNAs are used and wheremultiple subcloning steps make mutagenesis and shuffling of unique sitesin large vectors particularly tedious and time consuming.

3.5.3.3. Generating Homologous Recombination Intermediates In Vitro,Panels or Libraries of Mutagenized and Shuffled Genes to Generate InVitro Evolution

Accordingly, the present invention provides methods to introducerecombinogenic probe or hybrid complexes into recombination proficientcells to link in vitro and in vivo recombination and evolutionprocesses. By generating homologous recombination intermediates invitro, panels or libraries of mutagenized and shuffled genes aregenerated for in vitro evolution. The link to in vivo systems allows invivo selection of evolved genes encoding proteins of a desiredcharacteristic. The present invention can thus be used in a variety ofimportant ways.

3.5.3.3.1. Methods Can Be Used in the Creation of Transgenic Organisms,Animal, and Plant Models of Disease

First, these methods can be used in the creation of transgenicorganisms, animal, and plant models of disease. Thus, for example,domain-specific targeting polynucleotides used in homologousrecombination methods can generate animals that have a wide variety ofmutations in a wide variety of functionally related genes, potentiallyresulting in a wide variety of phenotypes, including phenotypes relatedto disease states. This may also be done on a cellular level, toidentify genes involved in cellular phenotypes, i.e. targetidentification.

3.5.3.3.2. Identity “Reversion” Genes, Genes That Can Modulate DiseaseStates

Secondly, domain targeting can be used in cells or animals that arediseased or altered; in essence, domain targeting can be done toidentify “reversion” genes, genes that can modulate disease statescaused by different genes, either genes within the same gene family or acompletely different gene family. Thus, for example the loss of one typeof enzymatic activity, resulting in a disease phenotype, may becompensated by alterations in a different but homologous enzymaticactivity.

3.5.3.3.3. Creation of Libraries of Altered Nucleic Acids

In addition, the methods may be used in the creation of libraries ofaltered nucleic acids, including extrachromosomal sequences, and can beexpressed in cells to produce libraries of altered proteins, which thencan be screened for any number of useful or interesting properties,including, but not limited to, increased or altered stability (thermal,pH, oxidants, to proteases, etc.); altered specificity (for example, inthe case of enzymes); altered binding; modified activity and otherdesirable properties, such as, altered immunogenicity.

3.5.3.4. Use of Homology Motif Tags (HMTs) in Targeted HomologousRecombination to Elucidate Disease Mechanisms and to Identify DiseaseTargets Contained within Gene Families

The present invention is directed to the use of homology motif tags(HMTs) in targeted homologous recombination to elucidate diseasemechanisms and to identify disease targets contained within genefamilies related by the presence of one or more common domains. That is,there are a large number of gene families that contain genes related bythe presence of similar functional domains, i.e. binding domains forsubstrates or other proteins, enzymatic domains such as kinase orprotease domains, signaling and regulator domains, receptor bindingdomains, ATP binding domains, leucine zipper domains, zinc fingerdomains, etc. These functional domains frequently result in primarysequence homology; that is, related functional domains have relatedsequences. Many of these functional domains have been studied andso-called “consensus sequences” identified; that is, an average sequencederived from a number of related sequences. Each residue (or set ofresidues) of the consensus sequence is the most frequent at thatposition in the set under consideration. Consensus sequences can beeither amino acid or nucleic acid consensus sequences, with amino acidsequences being used to generate nucleic acid consensus sequences.

Interestingly, while a wide variety of gene families are known, themajority of drug targets come from only four of these gene families.These are the G-protein coupled or seven-transmembrane domain receptors,nuclear (hormone) receptors, ion channels, esterases. Other importantgene families are enzymes, including recombinases. Of the top 100pharmaceutical drugs, 18 bind to seven-transmembrane receptors, 10 tonuclear receptors and 16 to ion channels.

By using HMTs directed to the consensus sequences of gene families forhomologous recombination and particularly enhanced homologousrecombination methods, sequence modifications may be made to any numberof targeted genes in a related family.

3.53.4.1. Methods and Compositions Utilizing Homology Motif Tags (HMTs)or Consensus Sequences

Accordingly, the present invention provides methods and compositionsutilizing homology motif tags (HMTs) or consensus sequences. By“homology motif tag” or “protein consensus sequence” herein is meant anamino acid consensus sequence of a gene family. By “consensus nucleicacid sequence” herein is meant a nucleic acid that encodes a consensusprotein sequence of a functional domain of a gene family. In addition,“consensus nucleic acid sequence” can also refer to cis sequences thatare non-coding but can serve a regulatory or other role. As outlinedbelow, generally a library of consensus nucleic acid sequences are used,that comprises a set of degenerate nucleic acids encoding the proteinconsensus sequence. A wide variety of protein consensus sequences for anumber of gene families are known. A “gene family” therefore is a set ofgenes that encode proteins that contain a functional is domain for whicha consensus sequence can be identified. However, in some instances, agene family includes non-coding sequences; for example, consensusregulatory regions can be identified. For example, gene family/consensussequences pairs are known for the G-protein coupled receptor family, theAAA-protein family, the bZIP transcription factor family, the mutSfamily, the recA family, the Rad51 family, the dmel family, the recFfamily, the SH2 domain family, the Bcl-2 family, the single-strandedbinding protein family, the TFIID transcription family, the TGF-betafamily, the TNF family, the XPA family, the XPG family, actin bindingproteins, bromodomain GDP exchange factors, MCM family, ser/thrphosphatase family, etc.

As will be appreciated by those in the art, the proteins of the genefamilies generally do not contain the exact consensus sequences;generally consensus sequences are artificial sequences that representthe best comparison of a variety of sequences. The actual sequence thatcorresponds to the functional sequence within a particular protein istermed a “consensus functional domain” herein; that is, a consensusfunctional domain is the actual sequence within a protein thatcorresponds to the consensus sequence. A consensus functional domain mayalso be a “predetermined endogenous DNA sequence” (also referred toherein as a “predetermined target sequence”) that is a polynucleotidesequence contained in a target cell. Such sequences can include, forexample, chromosomal sequences (e.g., structural genes, regulatorysequences including promoters and enhancers, recombinatorial hotspots,repeat sequences, integrated proviral sequences, hairpins, palindromes),episomal or extrachromosomal sequences (e.g., replicable plasmids orviral replication intermediates) including chloroplast and mitochondrialDNA sequences. By “predetermined” or “pre-selected” it is meant that theconsensus functional domain target sequence may be selected at thediscretion of the practitioner on the basis of known or predictedsequence information, and is not constrained to specific sitesrecognized by certain site-specific recombinases (e.g., FLP recombinaseor CRE recombinase). In some embodiments, the predetermined endogenousDNA target sequence will be other than a naturally occurring germlineDNA sequence (e.g., a transgene, parasitic, mycoplasmal or viralsequence).

3.5.3.4.1.1. Gene Family is the G-Protein Coupled Receptor Family

3.5.3.4.1.1.1. Subfamily 1 Also Called R7G Proteins

In a preferred embodiment, the gene family is the G-protein coupledreceptor family, which has over 900 identified members, includingseveral subfamilies. In a preferred embodiment, the G-protein coupledreceptors are from subfamily I and are also called R7G proteins. Theyare an extensive group of receptors which recognize hormones,neurotransmitters, odorants and light and transduce extracellularsignals by interaction with guanine (G) nucleotide-binding proteins. Thestructure of all these receptors is thought to be virtually identical,and they contain seven hydrophobic regions, each of which putativelyspans the membrane. The N-terminus is extracellular and is frequentlyglycosylated, and the C-terminus is cytoplasmic and generallyphosphorylated. Three extracellular loops alternate with threecytoplasmic loops to link the seven transmembrane regions. G-proteincoupled receptors include, but are not limited to: the class A rhodopsinfirst subfamily, including amine (acetylcholine (muscarinic),adrenoceptors, domamine, histamine, serotonin, octopamine), peptides(angiotensin, bombesin, bradykinin, C5a anaphylatoxin, Finet-leu-phe,interleukin-8, chemokine, CCK, endothelin, mealnocortin, neuropeptide Y,neurotensin, opioid, somatostatin, tachykinin, thrombin,vasopressin-like, galanin, proteinase activated), hormone proteins(follicle stimulating hormone, lutropin-choriogonadotropic hormone,thyrotropin), rhodopsin (vertebrate), olfactory (olfactory type I-II,gustatory), prostanoid (prostaglandin, prostacyclin, thromboxane),nucleotide (adenosine, purinoceptors), cannabis, platelet activatingfactor, gonadotropin-releasing hormone (gonadotropin releasing hormone,thyrotropin-releasing hormone, growth hormone secretagogue), melatonin,viral proteins, MHC receptor, Mas proto-oncogene, EBV-induced andglucocorticoid induced; the class B secretin second subfamily, includingcalcitonin, corticotropin releasing factor, gastric inhibitory peptide,glucagon, growth hormone releasing hormone, parathyroid hormone,secretin, vasoactive intestinal polypeptide, and diuretic hormone; theclass C metabotropic glutamate third subfamily, including metabrotropicglutamate and extracellular calcium-sensing agents; and the class Dpheromone fourth subfamily. Because of the large number of familymembers, these large classes of GPCRs can be further subdivided intosubfamilies. Examples of these subfamilies are calcitonin, glucagon,vasoactive and parathyroid are from class B; and acetylcholine,histamine angiotensin, alpha2- and beta-adrenergic are from class A.From each subfamily small protein consensus sequences can be derivedfrom sequence alignments. For example, there are 6 motifs for themetabotripic glutamate like GPRCs derived from the indicated number offamily members. Using the protein consensus sequence, degenerate nucleicacid probes are made to encode the protein consensus sequence, as iswell known in the art. The protein sequence is encoded by DNA tripletswhich are deduced using standard tables. In some cases additionaldegeneracy is used to enable production in one oligonucleotidesynthesis. In many cases motifs were chosen to minimize degeneracy.Amplification of neighboring sequences can utilize two motifs asindicated by faithful or error prone amplification. Alternativelyoutside sequences can be used as is indicated using vector sequence. Inaddition degenerate oligos can be synthesized and used directly in theprocedure without amplification. Double stranded (ds) DNA probes aredenatured and coated with RecA or another recombinase such as Rad51.This material can be used to bind to and allow capture of specificclones from cDNA or genomic libraries. Alternatively this material canbe introduced into cells producing transgenic cells or animals withalterations in related family members.

3.5.3.4.1.1.2. Second Subfamily Encoding Receptors that Bind PeptideHormones that do not Show Sequence Similarity to the First R7G Subfamily

In addition to the first subfamily of G-protein coupled receptors, thereis a second subfamily encoding receptors that bind peptide hormones thatdo not show sequence similarity to the first R7G subfamily. All thecharacterized receptors in this subfamily are coupled to G-proteins thatactivate both adenylyl cyclase and the phosphatidylinositol-calciumpathway. However, they are structurally similar; like classical R7Gproteins they putatively contain seven transmembrane regions, aglycosylated extracellular N-terminus and a cytoplasmic C-terminus.Known receptors in this subfamily are encoded on multiple exons, andseveral of these genes are alternatively spliced to yield functionallydistinct products. The N-terminus contains five conserved cysteineresidues putatively important in disulfide bonds. Known G-proteincoupled receptors in this subfamily are listed above.

3.5.3.4.1.1.3. Third Subfamily Encoding Receptors that Bind Glutamateand Calcium but do not Show Sequence Similarity to Either of the OtherSubfamilies

In addition to the first and second subfamilies of G-protein coupledreceptors, there is a third subfamily encoding receptors that bindglutamate and calcium but do not show sequence similarity to either ofthe other subfamilies. Structurally, this subfamily has signalsequences, very large hydrophobic extracellular regions of about 540 to600 amino acids that contain 17 conserved cysteines (putatively involvedin disulfides), a region of about 250 residues that appear to containseven transmembrane domains, and a C-terminal cytoplasmic domain ofvariable length (50 to 350 residues). Known G-protein coupled receptorsof this subfamily are listed above.

3.5.3.4.1.2. Gene Family is the bZIP Transcription Factor Family

In a preferred embodiment, the gene family is the bZIP transcriptionfactor family. This eukaryotic gene family encodes DNA bindingtranscription factors that contain a basic region that mediates sequencespecific DNA binding, and a leucine zipper, required for dimerization.The bZIP family includes, but is not limited to, AP-1, ATF, CREB, CREM,FOS, FRA, GBF, GCN4, HBP, JUN, MET4, OCS1, OP, TAFI, XBP1, and YBBO.

In a preferred embodiment, the gene family is involved in DNA mismatchrepair, such as mutL, hexB and PMS1. Members of this family include, butare not limited to, MLH1, PMS1, PMS2, HexB and MulL. The proteinconsensus sequence is G-F—R-G-E-A-L.

In a preferred embodiment, the gene family is the mutS family, alsoinvolved in mismatch repair of DNA, directed to the correction ofmismatched base pairs that have been missed by the proofreading elementof the DNA polymerase complex. MutS gene family members include, but arenot limited to, MSH2, MSH3, MSH6 and MutS. In a preferred embodiment,the gene family is the recA family. The bacterial recA is essential forhomologous recombination and recombinatorial repair of DNA damage. RecAhas many activities, including the formation of nucleoprotein filaments,binding to single stranded and double stranded DNA, binding andhydrolyzing ATP, recombinase activity and interaction with lexA causinglexA activation and autocatalytic cleavage. RecA family members includethose from E. coli, drosophila, human, lily, etc. specifically includingbut not limited to, E. coli recA, Rec1, Rec2, Rad51, Rad51B, Rad51C,Rad51D, Rad51 E, XRCC2 and DMCI.

3.5.3.4.1.3. Gene Family is the RecF Family

In a preferred embodiment, the gene family is the recF family. Theprokaryotic recF protein is a single-stranded DNA binding protein whichalso putatively binds ATP. RecF is involved in DNA metabolism; it isrequired for recombinatorial DNA repair and for induction of the SOSresponse. RecF is a protein of about 350 to 370 amino acid residues;there is a conserved ATP-binding site motif “A” in the N-terminalsection of the protein as well as two other conserved regions, onelocated in the central section and the other in the C-terminal section.

3.5.3.4.1.4. Gene Family is the Bcl-2 Family

In a preferred embodiment, the gene family is the Bcl-2 family.Programmed cell death (PCD), or apoptosis, is induced by events such asgrowth factor withdrawal and toxins. It is generally controlled byregulators, which have either an inhibitory effect (i.e. anti-apoptotic)or block the protective effect of inhibitors (pro-apoptotic). Manyviruses have found a way of countering defensive apoptosis by encodingtheir own anti-apoptotic genes thereby preventing their target cellsfrom dying too soon.

All proteins belonging to the Bcl-2 family contain at least one of aBH1, BH2, BH3 or BH4 domain. All anti-apoptotic proteins contain BH1 andBH2 domains, some of them contain an additional N-terminal BH4 domain(such as Bcl-2, Bcl-x(L), Bcl-W, etc.), which is generally not found inpro-apoptotic proteins (with the exception of Bcl-x(S). Generally allpro-apoptotic proteins contain a BH3 domain (except for Bad), thought tobe crucial for the dimerization of the proteins with other Bcl-2 familymembers and crucial for their killing activity. In addition, some of thepro-apoptotic proteins contain BH1 and BH2 domains (such as Bax andBak). The BH3 domain is also present in some anti-apoptosis proteins,such as Bcl-2 and Bcl-x(L). Known Bcl-2 proteins include, but are notlimited to, Bcl-2, Bcl-x(L), Bcl-W, Bcl-x(S), Bad, Bax, and Bak.

3.5.3.4.1.5. Gene Family is the Site-Specific Recombinase Family

In a preferred embodiment, the gene family is the site-specificrecombinase family. Site-specific recombination plays an important rolein DNA rearrangement in a) recombination between inverted repeatsresulting in the reversal of a DNA segment; and b) recombination betweenrepeat sequences on two DNA molecules resulting in their cointegration,or between repeats on one DNA molecule resulting the excision of a DNAfragment. Site-specific recombination is characterized by a strandexchange mechanism that requires no DNA synthesis or high energycofactor; the phosphodiester bond energy is conserved in aphospho-protein linkage during strand cleavage and re-ligation.

Two unrelated families of recombinases are currently known. The first,called the “phage integrase” family, groups a number of bacterial, phageand yeast plasmid enzymes. The second, called the “resolvase” family,groups enzymes which share the following structural characteristics: anN-terminal catalytic and dimerization domain that contains a conservedserine residue involved in the transient covalent attachment to DNA, anda C-terminal helix-turn-helix DNA-binding domain.

3.5.3.4.1.6. Gene Family is the Single-Stranded Binding Protein Family

In a preferred embodiment, the gene family is the single-strandedbinding protein family. The E coli single-stranded binding protein(ssb), also known as the helix-destabilizing protein, is a protein of177 amino acids. It binds tightly as a homotetramer to a single-strandedDNA ss-DNA) and plays an important role in DNA replication,recombination and repair. Members of the ssb family include, but are notlimited to, E. coli ssb and eukaryotic RPA proteins.

3.5.3.4.1.7. Gene Family Is The TFIID Transcription Family

In a preferred embodiment, the gene family is the TFIID transcriptionfamily. Transcription factor TRID (or TATA-binding protein, TBP), is ageneral factor that plays a major role in the activation of eukaryoticgenes transcribed by RNA polymerase II. TRID binds specifically to theTATA box promoter element which lies close to the position oftranscription initiation. There is a remarkable degree of sequenceconservation of a C-terminal domain of about 180 residues in TFIID fromvarious eukaryotic sources. This region is necessary and sufficient forTATA box binding. The most significant structural feature of this domainis the presence of two conserved repeats of a 77 amino-acid region.

3.5.3.4.1.8. Gene Family is the TGF-beta Family

In a preferred embodiment, the gene family is the TGF-beta family.Transforming growth factor-beta (TGF-beta) is a multifunctional proteinthat controls proliferation, differentiation and other functions in manycell types. TGF-beta-1 is a protein of 112 amino acid residues derivedby proteolytic cleavage from the C-terminal portion of the precursorprotein. Members of the TGF-beta family include, but are not limited to,the TGF-1-3 subfamily (including TGF1, TGF2, and TGF3); the BMP3subfamily (BM3B, BMP3); the BMP5-8 subfamily (BM8A, BMP5, BMP6, BMP7,and BMP8); and the BMP 2 & 4 subfamily (BMP2, BMP4, DECA).

In a preferred embodiment, the gene family is the TNF family. A numberof cytokines can be grouped into a family on the basis of amino acidsequence, as well as structural and functional similarities. Theseinclude (1) tumor necrosis factor (TNF), also known as cachectin orTNF-alpha, which is a cytokine with a wide variety of functions.TNF-alpha can cause cytolysis of certain tumor cell lines; it isinvolved in the induction of cachexia; it is a potent pyrogen, causingfever by direct action or by stimulation of interleukin-1 secretion; andit can stimulate cell proliferation and induce cell differentiationunder certain conditions; (2) lymphotoxin-alpha (LT-alpha) andlymphotoxin-beta (LT-beta), two related cytokines produced bylymphocytes and which are cytotoxic for a wide range of tumor cells invitro and in vivo; (3) T cell antigen gp39 (CD40L), a cytokine thatseems to be important in B-cell development and activation; (4) CD27L, acytokine that plays a role in T-cell activation; it induces theproliferation of costimulated T cells and enhances the generation ofcytolytic T cells; (5) CD30L, a cytokine that induces proliferation ofT-cells; (6) FASL, a cytokine involved in cell death; (8) 4-1 BBL, aninducible T cell surface molecule that contributes to T-cellstimulation; (9) OX40L, a cytokine that co-stimulates T cellproliferation and cytokine production; and (10), TNF-related apoptosisinducing ligand (TRAIL), a cytokine that induces apoptosis.

3.5.3.4.1.9. Gene Family is the XPA Family

In a preferred embodiment, the gene family is the XPA family. Xerodermapigmentosa (XP) is a human autosomal recessive disease, characterized bya high incidence of sunlight-induced skin cancer. Skin cells associatedwith this condition are hypersensitive to ultaviolet light, due todefects in the incision step of DNA excision repair. There are a minimumof 7 genetic complementation groups involved in this disorder: XPA toXPG. XPA is the most common form of the disease and is due to defects ina 30 kD nuclear protein called XPA or (XPAC). The sequence of XPA isconserved from higher eukaryotes to yeast (gene RAD14). XPA is ahydrophilic protein of 247 to 296 amino acid residues that has a C4-typezinc finger motif in its central section.

3.5.3.4.1.10. Gene Family is the XPG Family

In a preferred embodiment, the gene family is the XPG family. The defectin XPG can be corrected by a 133 kD nuclear protein called XPG (orXPGC). Members of the XPG family include, but are not limited to, FENI,XPG, RAD2, EXO1, and DIN7.

Once having identified a gene family and a consensus sequence, thecompositions of the invention can be made. The compositions of theinvention comprise at least one recombinase and at least twosingle-stranded targeting polynucleotides which are substantiallycomplementary to each other and each have a consensus homology clamp fora gene family.

3.53.5. Homologous Recombination

Accordingly, the present invention provides methods of homologousrecombination. By “homologous recombination” (HR) herein is meant anexchange of homologous or similar DNA sequence between two DNAmolecules. An essential feature of HR is that the enzyme responsible forthe recombination event can pair any homologous sequences as substrates.The ability of HR to transfer genetic information between DNA moleculesmakes targeted homologous recombination a very powerful method ingenetic engineering and gene manipulation. HR can be used to insert,delete, and/or substitute any one or more nucleotides in a gene or genesegment or to introduce or delete genes in a targeted nucleic acid.

Once having identified a protein domain, the compositions of theinvention can be made. The compositions of the invention comprise atleast one recombinase and at least two single-stranded targetingpolynucleotides which are substantially complementary to each other andeach have a domain homology clamp.

3.5.3.6. Recombinase

By “recombinase” herein is meant a protein or peptide (e.g. L2 peptide)that, when included with an exogenous targeting polynucleotide, providea measurable increase in the recombination frequency and/or localizationfrequency between the targeting polynucleotide and an endogenouspredetermined DNA sequence. Thus, in a preferred embodiment, increasesin recombination frequency from the normal range of 10⁻⁸ to 10⁻⁴, to10⁻⁴ to 10¹, preferably 10⁻³ to 10¹, and most preferably 10⁻² to 10⁰,may be achieved.

In the present invention, recombinase refers to a family of RecA-likeand Rad51-like recombination proteins all having essentially all or mostof the same functions, particularly: (i) the recombinase protein'sability to properly bind to and position targeting polynucleotides ontheir homologous targets and (ii) the ability of recombinaseprotein/targeting polynucleotide complexes to efficiently find and bindto complementary endogenous sequences. The best characterized RecAprotein is from E coli, in addition to the wild-type protein a number ofmutant RecA proteins have been identified (e.g., RecA803; see Madirajuet al., PNAS USA 85(18): 6592 (1988); Madiraju et al, Biochem. 31: 10529(1992); Layery et al., J. Biol. Chem. 267: 20648 (1992)). Further, manyorganisms have RecA-like recombinases with strand-transfer activities(e.g., Fugisawa et al., (1985) Nucl. Acids Res. 13: 7473; Hsieh et al.,(1986) Cell 44: 885; Hsieh et al., (1989) J. Biol. Chem. 264: 5089;Fishel et al., (1988) Proc. Natl. Acad. Sci. (USA) 85: 3683; Cassuto etal., (1987) Mol. Gen. Genet. 208: 10; Ganea et al., (1987) Mol. CellBiol. 7: 3124; Moore et al., (1990) J. Biol. Chem. 19: 11108; Keene etal., (1984) Nucl. Acids Res. 12: 3057; Kimeic, (1984) Cold Spring HarborSymp. 48: 675; Kmeic, (1986) Cell 44: 545; Kolodner et al., (1987) Proc.Natl. Acad. Sci. USA 84: 5560; Sugino et al., (1985) Proc. Natl. Acad.Sci. USA 85: 3683; Halbrook et al., (1989) J. Biol. Chem. 264: 21403;Eisen et al., (1988) Proc. Natl. Acad. Sci. USA 85: 7481; McCarthy etal., (1988) Proc. Natl. Acad. Sci. US 85: 5854; Lowenhaupt et al.,(1989) J. Biol. Chem 264: 20568, which are incorporated herein byreference. Examples of such recombinase proteins include, for examplebut not limited to: RecA, RecA803, uvsX, and other RecA mutants andRecA-like recombinases (Roca, A. 1. (1990) Crit. Rev. Biochem. Molec.Biol. 25: 415), sep1 (Kolodner et al. (1987) Proc. Natl. Acad. Sci.(U.S.6.1 B4: 5560; Tishkoff et al. Molec. Cell. Biol. 11: 2593), RuvC(Dunderdale et al. (1991) Nature 354: 506), DST2, KEMI, XRN I (Dykstraet al. (1991) Molec. Cell. Biol. 11: 2583), STPalpha/DSTI (Clark et al.(1991) Molec. Cell. Biol. 11: 2576), HPP—I (Moore et al. (1991) Proc.Natl. Acad. Sci. (U.S.A. I B8: 9067), other target recombinases (Bishopet al. (1992) Cell 69 439; Shinohara et al. (1992) Cell 69 457);incorporated herein by reference. RecA may be purified from E colistrains, such as E coli strains JC 12772 and JC 1 5369 (available fromA. J. Clark and M. Madiraju, University of California-Berkeley, orpurchased commercially). These strains contain the RecA coding sequenceson a “runaway” replicating plasmid vector present at a high copy numbersper cell. The RecA803 protein is a high-activity mutant of wild-typeRecA. The art teaches several examples of recombinase proteins, forexample, from Drosophila, yeast, plant, human, and non-human mammaliancells, including proteins with biological properties similar to RecA(i.e., RecA-like recombinases), such as Rad51, Rad55, Rad57, dmcl frommammals and yeast. In addition, the recombinase may actually be acomplex of proteins, i.e. a “recombinosome”. In addition, includedwithin the definition of a recombinase are portions or fragments ofrecombinases which retain recombinase biological activity, as well asvariants or mutants of wild-type recombinases which retain biologicalactivity, such as the E. coli RecA803 mutant with enhanced recombinaseactivity.

3.5.3.6.1. RecA or rad51

In a preferred embodiment, RecA or rad51 is used. For example, RecAprotein is typically obtained from bacterial strains that overproducethe protein: wild-type E coli RecA protein and mutant RecA803 proteinmay be purified from such strains. Alternatively, RecA protein can alsobe purchased from, for example, Pharmacia (Piscataway, N.J.) orBoehringer Mannheim (Indianapolis, Ind.). RecA proteins, and itshomologs, form a nucleoprotein filament when it coats a single-strandedDNA. In this nucleoprotein filament, one monomer of RecA protein isbound to about 3 nucleotides. This property of RecA to coatsingle-stranded DNA is essentially sequence independent, althoughparticular sequences favor initial loading of RecA onto a polynucleotide(e.g., nucleation sequences). The nucleoprotein filament(s) can beformed on essentially any DNA molecule and can be formed in cells (e.g.,mammalian cells), forming complexes with both single-stranded anddouble-stranded DNA, although the loading conditions for dsDNA aresomewhat different than for ssDNA.

3.5.3.6.1.1. The Recombinase is Combined with Targeting Polynucleotides

The recombinase is combined with targeting polynucleotides as is morefully outlined below. By “nucleic acid” or “oligonucleotide” or“polynucleotide” or grammatical equivalents herein means at least twonucleotides covalently linked together. A nucleic acid of the presentinvention will generally contain phosphodiester bonds, although in somecases nucleic acid analogs are included that may have alternatebackbones, comprising, for example, phosphoramide (Beaucage et al.,Tetrahedron 49(10): 1925 (1993) and references therein; Letsinger, J.Org. Chem. 35: 3800 (1970); Sprinzl et al., Eur. J. Biochem. 81: 579(1977); Letsinger et al., Nucl. Acids Res. 14: 3487 (1986); Sawai et al,Chem. Lett. 805 (1984), Letsinger et al., J. Am. Chem. Soc. 110: 4470(1988); and Pauwels et al., Chemica Scripta 26: 141 91986)),phosphorothioate, phosphorodithioate, O— methylphophoroamidite linkages(see Eckstein, Oligonucleotides and Analogues: A Practical Approach,Oxford University Press), and peptide nucleic acid backbones andlinkages (see Egholm, J. Am. Chem. Soc. 114: 1895 (1992); Meier et al.,Chem. Int. Ed. Engl. 31: 1008 (1992); Nielsen, Nature, 365: 566 (1993);Carlsson et al., Nature 380: 207 (1996), all of which are incorporatedby reference). These modifications of the ribose-phosphate backbone orbases may be done to facilitate the addition of other moieties such aschemical constituents, including 2′ O-methyl and 5′ modifiedsubstituents, as discussed below, or to increase the stability andhalf-life of such molecules in physiological environments.

The nucleic acids may be single stranded or double stranded, asspecified, or contain portions of both double stranded or singlestranded sequence. The nucleic acid may be DNA, both genomic and cDNA,RNA or a hybrid, where the nucleic acid contains any combination ofdeoxyribo- and ribo-nucleotides, and any combination of bases, includinguracil, adenine, thymine, cytosine, guanine, inosine, xathanine andhypoxathanine, etc. Thus, for example, chimeric DNA-RNA molecules may beused such as described in Cole-Strauss et al., Science 273: 1386 (1996)and Yoon et al., PNAS USA 93: 2071 (1996), both of which are herebyincorporated by reference.

In general, the targeting polynucleotides may comprise any number ofstructures, as long as the changes do not substantially effect thefunctional ability of the targeting polynucleotide to result inhomologous recombination. For example, recombinase coating of alternatestructures should still be able to occur.

By “targeting polynucleotides” herein is meant the polynucleotides usedto make alterations in the protein domains as described herein.Targeting polynucleotides are generally ssDNA or dsDNA, most preferablytwo complementary single stranded DNAs.

Targeting polynucleotides are generally at least about 5 to 2000nucleotides long, preferably about 12 to 200 nucleotides long, at leastabout 200 to 500 nucleotides long, more preferably at least about 500 to2000 nucleotides long, or longer; however, as the length of a targetingpolynucleotide increases beyond about 20,000 to 50,000 to 400,000nucleotides, the efficiency or transferring an intact targetingpolynucleotide into the cell decreases. The length of homology may beselected at the discretion of the practitioner on the basis of thesequence composition and complexity of the predetermined endogenoustarget DNA sequence(s) and guidance provided in the art, which generallyindicates that 1.3 to 6.8 kilobase segments of homology are preferredwhen non-recombinase mediated methods are utilized (Hasty et al. (1991)Molec. Cell. Biol. 11: 5586; Shulman et al. (1990) Molec. Cell. Biol.10: 4466, which are incorporated herein by reference).

Targeting polynucleotides have at least one sequence that substantiallycorresponds to, or is substantially complementary to, a predeterminedendogenous DNA sequence. As used herein, the terms “predetermined targetnucleic acid” and “predetermined target sequence” and “predetermineddomain of a target nucleic acid” refer to polynucleotide sequencescontained in a target nucleic acid. Such sequences include, for example,chromosomal sequences (e.g., structural genes, regulatory sequencesincluding promoters and enhancers, recombinatorial hotspots, repeatsequences, integrated proviral sequences, hairpins, palindromes),episomal or extrachromosomal sequences (e.g., replicable plasmids orviral replication intermediates) including chloroplast and mitochondrialDNAsequences. By “predetermined” or “pre-selected” it is meant that thetarget sequence may be selected at the discretion of the practitioner onthe basis of known or predicted sequence information, and is notconstrained to specific sites recognized by certain site-specificrecombinases (e.g., FLIP recombinase or CRE recombinase). In someembodiments, the predetermined endogenous DNA target sequence will beother than a naturally occurring germline DNA sequence (e.g., atransgene, parasitic, mycoplasmal or viral sequence). An exogenouspolynucleotide is a polynucleotide which is transferred into a targetcell but which has not been replicated in that host cell; for example, avirus genome polynucleotide that enters a cell by fusion of a virion tothe cell is an exogenous polynucleotide, however, replicated copies ofthe viral polynucleotide subsequently made in the infected cell areendogenous sequences (and may, for example, become integrated into acell chromosome). Similarly, transgenes which are microinjected ortransfected into a cell are exogenous polynucleotides, howeverintegrated and replicated copies of the transgene(s) are endogenoussequences.

3.5.3.6.1.1.1. Target Nucleic Acid Comprises a Nucleotide SequenceEncoding a Protein or Polypeptide or can be Made to Comprise Non-CodingRegions as Well

In a preferred embodiment, the target nucleic acid comprises anucleotide sequence encoding a protein or polypeptide, although asoutlined herein, target nucleic acids may be made to non-coding regionsas well. By “protein” herein is meant at least two covalently attachedamino acids, which includes proteins, polypeptides, oligopeptides andpeptides. Thus “amino acid” or “peptide residue”, as used herein meansnaturally occurring and naturally modified amino acids. For example,“amino acid” also includes imino acid residues such as proline andhydroxyproline. A “naturally modified amino acid” includes for examples,amino acids that are modified to contain carbohydrate structures, suchas high-mannose or complex carbohydrates, phosphate, or lipids. In thepreferred embodiment, the amino acids are in the (S) or L-configuration.

The nucleotide sequence encoding the polypeptide is preferably operablylinked to transcription and translation control elements operable in ahost cell of interest, such that, introduction of the target nucleicacid results in expression of the encoded protein. The transcriptioncontrol elements include a promoter, such as, a constitutive orinducible promoter. When the host cell of interest is a eukaryotic cell,enhancer elements are optionally employed. In a preferred embodiment thetarget nucleic acid is an extrachromosomal vector such as a plasmid. Inother embodiments, the target nucleic acid is a viral vector, such as, aretrovirus, a phage, a BAC, PAC, YAC, MAC or other types of genomic andchromosomal DNA.

The term “naturally-occurring” as used herein as applied to an objectrefers to the fact that an object can be found in nature. For example, apolynucleotide sequence that is present in an organism (includingviruses) that can be isolated from a source in nature and which has notbeen intentionally modified by man in the laboratory isnaturally-occurring.

3.5.3.6.1.1.2. The Target Nucleic Acid Comprises a Nucleic Acid Encodinga Protein Domain

The methods of the invention are used for alteration and evolution ofprotein domains; that is, in a preferred embodiment, the target nucleicacid comprises a nucleic acid encoding a protein domain. By “proteindomain” and grammatical equivalents as used herein are meant a region ofa protein that provides a specific structural and/or functionalcharacteristic. Accordingly, a protein domain is an enzymatic activesite, a ligand binding site, an allosteric effector region, an epitope,a region of a protein that is modified, such as, by addition of acarbohydrate, phosphate or lipid. A domain also relates to thehydrophobicity or hydrophilicity of a region and, therefore, alsoincludes extracellular, intracellular, and transmembrane domains. Celltargeting sequences, such as, a signal peptide, nuclear localizationsequence, mitochondrial localization sequences, etc. that directproteins to either an extracellular or subcellular locale are domains.Additional domains include regions of proteins that interact with otherproteins or nucleic acids, for example, include multimerizationsequences, zinc-finger motifs, and the like. In another aspect, aprotein domain is a region encoded by an exon.

Targeting polynucleotides have at least one sequence that substantiallycorresponds to, or is substantially complementary to, a target nucleicacid; in a preferred embodiment, it corresponds or complements a nucleicacid encoding a protein domain. By “corresponds to” herein is meant thata polynucleotide sequence is homologous (i.e., may be similar oridentical, not strictly evolutionarily related) to all or a portion of areference polynucleotide sequence, or that a polypeptide sequence isidentical to a reference polypeptide sequence. In contradistinction, theterm “complementary to” is used herein to mean that the complementarysequence can hybridize to all or a portion of a reference polynucleotidesequence. Thus, one of the complementary single stranded targetingpolynucleotides is complementary to one strand of the endogenous targetdomain sequence (i.e. Watson) and corresponds to the other strand of theendogenous target domain sequence (i.e. Crick). Thus, thecomplementarity between two single-stranded targeting polynucleotidesneed not be perfect. For illustration, the nucleotide sequence “TATAC”corresponds to a reference sequence “TATAC” and is perfectlycomplementary to a reference sequence “GTATA.

The terms “substantially corresponds to” or “substantial identity” or“homologous” as used herein denotes a characteristic of a nucleic acidsequence, wherein a nucleic acid sequence has at least about 50 percentsequence identity as compared to a reference sequence, typically atleast about 70 percent sequence identity, and preferably at least about85 percent sequence identity as compared to a reference sequence. Thepercentage of sequence identity is calculated excluding small deletionsor additions which total less than 25 percent of the reference sequence.The reference sequence may be a subset of a larger sequence, such as aportion of a gene or flanking sequence, or a repetitive portion of achromosome. However, the reference sequence is at least 18 nucleotideslong, typically at least about 30 nucleotides long, and preferably atleast about 50 to 100 nucleotides long.

“Substantially complementary” as used herein refers to a sequence thatis complementary to a sequence that substantially corresponds to areference sequence. In general, targeting efficiency increases with thelength of the targeting polynucleotide portion that is substantiallycomplementary to a reference sequence present in the target DNA.

By “sequence homology” herein is meant sequence similarity or sequenceidentity. Nucleic acid similarity can be determined using, for example,BLASTN (Altschul et al. 1990. J. Mol. Biol. 147: 195-197). BLASTN uses asimple scoring system in which matches count +5 and mismatches −4. Toachieve computational efficiency, the default parameters have beenincorporated directly into the source code.

3.5.3.7. Percent Nucleic Acid Sequence Identity is Determined

In an alternative embodiment, percent nucleic acid sequence identity isdetermined. In percent identity calculations relative weight is notassigned to the various types of sequence variation, such as,insertions, deletions, substitutions, etc. Only identities are scoredpositively (+1) and all forms of sequence variation given a value of“0”, which obviates the need for a weighted scale or parameters asdescribed above for sequence similarity calculations. Percent sequenceidentity can be calculated, for example, by dividing the number ofmatching identical residues by the total number of residues of the“shorter” sequence in the aligned region and multiplying by 100. The“longer” sequence is the one having the most actual residues in thealigned region.

3.5.3.8. Domain Homology Clamps: a Portion of the TargetingPolynucleotide that can Specifically Hybridize to a Nucleic AcidEncoding a Domain within a Gene of Interest

These corresponding/complementary sequences are sometimes referred toherein as “domain homology clamps”, as they serve as templates forhomologous pairing with the predetermined endogenous sequence(s). Thus,a “domain homology clamp” is a portion of the targeting polynucleotidethat can specifically hybridize to a nucleic acid encoding a domainwithin a gene of interest. “Specific hybridization” is defined herein asthe formation of hybrids between a targeting polynucleotide (e.g., apolynucleotide of the invention which may include substitutions,deletion, and/or additions as compared to the predetermined targetnucleic acid sequence) and a predetermined target nucleic acid, whereinthe targeting polynucleotide preferentially hybridizes to thepredetermined target nucleic acid such that, for example, at least onediscrete band can be identified on a Southern blot of nucleic acidprepared from target cells that contain the target nucleic acidsequence, and/or a targeting polynucleotide in an intact nucleuslocalizes to a discrete chromosomal location characteristic of a uniqueor repetitive sequence. As will be appreciated by those in the art, atarget domain sequence may be present in more than one targetpolynucleotide species (e.g., a particular target sequence may occur inmultiple members of a gene family). It is evident that optimalhybridization conditions will vary depending upon the sequencecomposition and length(s) of the targeting polynucleotide(s) andtarget(s), and the experimental method selected by the practitioner.Various guidelines may be used to select appropriate hybridizationconditions (see, Maniatis et al., Molecular Cloning: A Laboratory Manual(1989), 2nd Ed., Cold Spring Harbor, N.Y. and Berger and Kimmel, Methodsin Enzymology. Volume 152, Guide to Molecular Cloning Techniques (1987),Academic Press, Inc., San Diego, Calif.), which are incorporated hereinby reference. Methods for hybridizing a targeting polynucleotide to adiscrete chromosomal location in intact nuclei are known in the art, seefor example WO 93/05177 and Kowalczykowski and Zarling (1994) in GeneTargeting, Ed. Manuel Vega.

In targeting polynucleotides, domain homology clamps are typicallylocated at or near the 5′ or 3′ end, preferably domain homology clampsare internal or located at each end of the polynucleotide (Berinstein etal. (1992) Molec, Cell. Biol. 12: 360, which is incorporated herein byreference). Without wishing to be bound by any particular theory, it isbelieved that the addition of recombinases permits efficient genetargeting with targeting polynucleotides having short (i.e., about 10 to1000 basepair long) segments of homology, as well as with targetingpolynucleotides having longer segments of homology.

3.5.3.9. Targeting Polynucleotides

3.5.3.9.1. Targeting Polynucleotides that have Domain Homology Clampsthat are Highly Homologous to the Predetermined Target Endogenous DomainFunctional Domain Nucleic Acid Sequence

Therefore, it is preferred that targeting polynucleotides of theinvention have domain homology clamps that are highly homologous to thepredetermined target endogenous domain functional domain nucleic acidsequence(s). Typically, targeting polynucleotides of the invention haveat least one domain homology clamp that is at least about 18 to 35nucleotides long, and it is preferable that domain homology clamps areat least about 20 to 100 nucleotides long, and more preferably at leastabout 100-500 nucleotides long, although the degree of sequence homologybetween the domain homology clamp and the targeted sequence and the basecomposition of the targeted sequence will determine the optimal andminimal clamp lengths (e.g., G-C rich sequences are typically morethermodynamically stable and will generally require shorter clamplength). Therefore, both domain homology clamp length and the degree ofsequence homology can only be determined with reference to a particularpredetermined sequence, but domain homology clamps generally must be atleast about 10 nucleotides long and must also substantially correspondor be substantially complementary to a predetermined target sequence.Preferably, a homology clamp is at least about 10, and preferably atleast about 50 nucleotides long and is substantially identical to orcomplementary to a predetermined target sequence. Without wishing to bebound by a particular theory, it is believed that the addition ofrecombinases to a targeting polynucleotide enhances the efficiency ofhomologous recombination between homologous, nonisogenic sequences(e.g., between an exon 2 sequence of an albumin gene of a Balb/c mouseand a homologous albumin gene exon 2 sequence of a C57/BL6 mouse), aswell as between isogenic sequences.

3.5.3.9.2. Targeting Polynucleotides Comprising a Plurality of TargetingPolynucleotides Comprising at Least one Shared Homology Clamp and aDegenerate Sequence

In one aspect of the invention, the targeting polynucleotides comprise aplurality of targeting polynucleotides comprising at least one sharedhomology clamp and a degenerate sequence. By “plurality” herein is meantmore than one. The targeting polynucleotides find use in the mutagenesisand evolution of a target nucleic acid sequence that encodes specificprotein domain by insertion, deletion and/or substitution of the nucleicacid sequence encoding the domain. In one embodiment the degeneratesequence is completely randomized, representing all possiblecombinations of nucleotides. In another embodiment, the degeneratesequence is biased, for example, to eliminate sequences encoding fortranscriptional or translational stop signals. In another embodiment,the degenerate sequence is biased, to represent the codon bias of a hostcell or class of organisms. The degenerate sequence is optionally biasedto randomize specific sequence while maintaining other sequencesconstant. The length of the degenerate sequence is determined by thepractitioner and is based on the desired number of nucleotides withinthe predetermined sequence to be modified.

35.3.9.3. Targeting Polynucleotides Substantially Identical to thePredetermined Target Sequence

In an alternative embodiment, the targeting polynucleotides aresubstantially identical to the predetermined target sequence. In thepresence of a recombinase, the targeting polynucleotides form complexeswith a predetermined target sequence of a target nucleic acid. As a partof the complex, the predetermined target sequence is resistant tonuclease digestion. The regions flanking the polynucleotide: targetcomplex are susceptible to single-strand specific exonucleases.Accordingly, to effect domain specific evolution, these regions arenicked and the resultant fragments are reassembled and recombined by PCRas described below and by Stemmer et al. Nature. 370: 389-391 andStemmer et al. PNAS USA 91: 10747-10751, hereby incorporated byreference.

The formation of heteroduplex joints is not a stringent process; geneticevidence supports the view that the classical phenomena of meiotic geneconversion and aberrant meiotic segregation results in part from theinclusion of mismatched base pairs in heteroduplex joints, and thesubsequent correction of some of these mismatched base pairs beforereplication. Observations on RecA protein have provided information onparameters that affect the discrimination of relatedness from perfect ornear-perfect homology and that affect the inclusion of mismatched basepairs in heteroduplex joints. The ability of RecA protein to drivestrand exchange past all single base-pair mismatches and to formextensively mismatched joints in superhelical DNA reflect its role inrecombination and gene conversion. This error-prone process may also berelated to its role in mutagenesis. RecA-mediated pairing reactionsinvolving DNA of X174 and G4, which are about 70 percent homologous,have yielded homologous recombinants (Cunningham et al. (1981) Cell 24:213), although RecA preferentially forms homologous joints betweenhighly homologous sequences, and is implicated as mediating a homologysearch process between an invading DNA strand and a recipient DNAstrand, producing relatively stable heteroduplexes at regions of highhomology. Accordingly, it is the fact that recombinases can drive thehomologous recombination reaction between strands which aresignificantly, but not perfectly, homologous, which allows geneconversion and the modification of target sequences. Thus, targetingpolynucleotides may be used to introduce nucleotide substitutions,insertions and deletions into an endogenous functional domain nucleicacid sequence, and thus the corresponding amino acid substitutions,insertions and deletions in proteins expressed from the endogenousdomain functional domain nucleic acid sequence. By “endogenous” in thiscontext herein is meant the naturally occurring sequence, i.e. sequencesor substances originating from within a cell or organism. Similarly,“exogenous” refers to sequences or substances originating outside thecell or organism.

3.5.3.9.4. Method Where Two Substantially Complementary TargetingPolynucleotides are Used

In a preferred embodiment, two substantially complementary targetingpolynucleotides are used.

3.53.9.5. Method where the Targeting Polynucleotides form a DoubleStranded Hybrid, which may be Coated with Recombinase

In one embodiment, the targeting polynucleotides form a double strandedhybrid, which may be coated with recombinase, although when therecombinase is RecA, the loading conditions may be somewhat differentfrom those used for single stranded nucleic acids.

3.5.3.9.6. Method where Two Substantially Complementary Single-StrandedTargeting Polynucleotides are Used

In a preferred embodiment, two substantially complementarysingle-stranded targeting polynucleotides are used. The twocomplementary single-stranded targeting polynucleotides are usually ofequal length, although this is not required. However, as noted below,the stability of the four strand hybrids of the invention is putativelyrelated, in part, to the lack of significant unhybridizedsingle-stranded nucleic acid, and thus significant unpaired sequencesare not preferred. Furthermore, as noted above, the complementaritybetween the two targeting polynucleotides need not be perfect. The twocomplementary single-stranded targeting polynucleotides aresimultaneously or contemporaneously introduced into a target cellharboring a predetermined endogenous target sequence, generally with atlease one recombinase protein (e.g., RecA). Under most circumstances, itis preferred that the targeting polynucleotides are incubated with RecAor other recombinase prior to introduction into a target cell, so thatthe recombinase protein(s) may be “loaded” onto the targetingpolynucleotide(s), to coat the nucleic acid, as is described below.Incubation conditions for such recombinase loading are described infra.A targeting polynucleotide may contain a sequence that enhances theloading process of a recombinase, for example a RecA loading sequence isthe recombinogenic nucleation sequence poly[d(A-C)], and its complement,poly[d(G-T)]. The duplex sequence poly[d(A-C)*d(G-T)n, where n is from 5to 25, is a middle repetitive element in target DNA.

There appears to be a fundamental difference in the stability ofRecA-protein-mediated D-loops formed between one single-stranded DNA(ssDNA) probe hybridized to negatively supercoiled DNA targets incomparison to relaxed or linear duplex DNA targets. Internally locateddsDNA target sequences on relaxed linear DNA targets hybridized by ssDNAprobes produce single D-loops, which are unstable after removal of RecAprotein (Adzuma, Genes Devel. 6: 1679 (1992); Hsieh et al, PNAS USA 89:6492 (1992); Chiu et al., Biochemistry 32: 13146 (1993)). This probe DNAinstability of hybrids formed with linear duplex DNA targets is mostprobably due to the incoming ssDNA probe W—C base pairing with thecomplementary DNA strand of the duplex target and disrupting the basepairing in the other DNA strand. The required high free-energy ofmaintaining a disrupted DNA strand in an unpaired ssDNA conformation ina protein-free single-D-loop apparently can only be compensated foreither by the stored free energy inherent in negatively supercoiled DNAtargets or by base pairing initiated at the distal ends of the joint DNAmolecule, allowing the exchanged strands to freely intertwine. However,the addition of a second complementary ssDNA to thethree-strand-containing single-D-loop stabilizes the deproteinizedhybrid joint molecules by allowing W-C base pairing of the probe withthe displaced target DNA strand. The addition of a second RecA-coatedcomplementary ssDNA (cssDNA) strand to the three-strand containingsingle D-loop stabilizes deproteinized hybrid joints located away fromthe free ends of the duplex target DNA (Sena & Zarling, Nature Genetics3: 365 (1993); Revet et al. J. Mol. Biol. 232: 779 (1993); Jayasena andJohnston, J. Mol. Bio. 230: 1015 (1993)). The resulting four-strandedstructure, named a double D-loop by analogy with the three-strandedsingle D-loop hybrid has been shown to be stable in the absence of RecAprotein. This stability likely occurs because the restoration of W-Cbasepairing in the parental duplex would require disruption of two W-Cbasepairs in the double-D-loop (one W-C pair in each heteroduplexD-loop).

Since each base-pairing in the reverse transition (double-D-loop toduplex) is less favorable by the energy of one W-C basepair, the pair ofcssDNA probes are thus kinetically trapped in duplex DNA targets instable hybrid structures. The stability of the double-D loop jointmolecule within internally located probe: target hybrids is anintermediate stage prior to the progression of the homologousrecombination reaction to the strand exchange phase. The double D-looppermits isolation of stable multistranded DNA recombinationintermediates.

In addition, when the targeting polynucleotides are used to generateinsertions or deletions in an endogenous nucleic acid sequence, as isdescribed herein, the use of two complementary single-stranded targetingpolynucleotides allows the use of internal homology clamps. The use ofinternal homology clamps allows the formation of stable deproteinizedcssDNA: probe target hybrids with homologous DNA sequences containingeither relatively small or large insertions and deletions within ahomologous DNA target. Without being bound by theory, it appears thatthese probe: target hybrids, with heterologous inserts in the cssDNAprobe, are stabilized by the re-annealing of cssDNA probes to each otherwithin the double-D-loop hybrid, forming a novel DNA structure with aninternal homology clamp. Similarly stable double-D-loop hybrids formedat internal sites with heterologous inserts in the linear DNA targets(with respect to the cssDNA probe) are equally stable. Because cssDNAprobes are kinetically trapped within the duplex target, themulti-stranded DNA intermediates of homologous DNA pairing arestabilized and strand exchange is facilitated.

3.5.3.10. Length of the Internal Homology Clamp (i.e. the Length of theInsertion or Deletion)

In a preferred embodiment, the length of the internal homology clamp(i.e. the length of the insertion or deletion) is from about 1 to 50% ofthe total length of the targeting polynucleotide, with from about 1 toabout 20% being preferred and from about 1 to about 10% being especiallypreferred, although in some cases the length of the deletion orinsertion may be significantly larger. As for the domain homologyclamps, the complementarity within the internal homology clamp need notbe perfect. A targeting polynucleotide used in a method of the inventiontypically is a single-stranded nucleic acid, usually a DNA strand, orderived by denaturation of a duplex DNA, which is complementary to one(or both) strand(s) of the target duplex nucleic acid. Thus, one of thecomplementary single stranded targeting polynucleotides is complementaryto one strand of the endogenous target sequence (i.e.Watson) and theother complementary single stranded targeting polynucleotide iscomplementary to the other strand of the endogenous target sequence(i.e. Crick). The domain homology clamp sequence preferably contains atleast 90-95% sequence homology with the target sequence (although asoutlined above, less sequence homology can be tolerated), to insuresequence-specific targeting of the targeting polynucleotide to theendogenous DNA domain target. Each single-stranded targetingpolynucleotide is typically about 50-600 bases long, although a shorteror longer polynucleotide may also be employed.

3.53.11. Method for Making the Targeting Polynucleotides

Once the gene family and domain sequence is selected, the targetingpolynucleotides are made, as will be appreciated by those in the art.For example, for large targeting polynucleotides, plasmids areengineered to contain an appropriately sized gene sequence with adeletion or insertion in the gene of interest and at least one flankinghomology clamp which substantially corresponds or is substantiallycomplementary to an endogenous target DNA sequence. Vectors containing atargeting polynucleotide sequence are typically grown in E coli and thenisolated using standard molecular biology methods. Alternatively,targeting polynucleotides may be prepared in single-stranded form byoligonucleotide synthesis methods, which may first require, especiallywith larger targeting polynucleotides, formation of subfragments of thetargeting polynucleotide, typically followed by splicing of thesubfragments together, typically by enzymatic ligation. In general, aswill be appreciated by those in the art, targeting polynucleotides maybe produced by chemical synthesis of oligonucleotides, nick-translationof a double-stranded DNA template, polymerase chain-reactionamplification of a sequence (or ligase chain reaction amplification),purification of prokaryotic or target cloning vectors harboring asequence of interest (e.g., a cloned cDNA or genomic clone, or portionthereof) such as plasmids, phagemids, YACs, cosmids, bacteriophage DNA,other viral DNA or replication intermediates, or purified restrictionfragments thereof, as well as other sources of single anddouble-stranded polynucleotides having a desired nucleotide sequence.When using microinjection procedures it may be preferable to use atransfection technique with linearized sequences containing onlymodified target gene sequence and without vector or selectablesequences. The modified gene site is such that a homologous recombinantbetween the exogenous targeting polynucleotide and the endogenous DNAtarget sequence can be identified by using carefully chosen primers andPCR, followed by analysis to detect if PCR products specific to thedesired targeted event are present (Erlich et al., (1991) Science 252:1643, which is incorporated herein by reference). Several studies havealready used PCR to successfully identify and then clone the desiredtransfected cell lines (Zimmer and Gruss, (1989) Nature 338: 150;Mouellic et al., (1990) Proc. Natl. Acad. Sci. USA 87: 4712; Shesely etal., (1991) Proc. Natl. Acad. Sci. USA 88: 4294, which are incorporatedherein by reference). This approach is very effective when the number ofcells receiving exogenous targeting polynucleotide(s) is high (i.e.,with microinjection, or with liposomes) and the treated cell populationsare allowed to expand to cell groups of approximately 1×10⁴ cells(Capecchi, (11989) Science 244: 1288). When the target gene is not on asex chromosome, or the cells are derived from a female, both alleles ofa gene can be targeted by sequential inactivation (Mortensen et al.,(1991) Proc. Natl. Acad. Sci. US 88: 7036). Alternatively, animalsheterologous for the target gene can be bred to homologously as is knownin the art.

The invention may also be practiced with individual targetingpolynucleotides which do not comprise part of a complementary pair. Ineach case, a targeting polynucleotide is introduced into a target cellsimultaneously or contemporaneously with a recombinase protein,typically in the form of a recombinase coated targeting polynucleotideas outlined herein (i.e., a polynucleotide pre-incubated withrecombinase wherein the recombinase is noncovalently bound to thepolynucleotide; generally referred to in the art as a nucleoproteinfilament).

3.5.3.12. Alterations in the Target Nucleic Acid Comprising a Domain orDomains of Interest

The present invention allows for the introduction of alterations in thetarget nucleic acid comprising a domain or domains of interest. That is,the fact that heterologies are tolerated in targeting polynucleotidesallows for two things: first, the use of a heterologous domain homologyclamps that may target genes encoding functional domains of a protein ormultiple proteins, resulting in a variety of genotypes and phenotypes,and secondly, the introduction of alterations to the target sequence.Thus typically, a targeting polynucleotide (or complementarypolynucleotide pair) has a portion or region having a sequence that isnot present in the preselected endogenous targeted sequence(s) (i.e., anonhomologous portion or mismatch) which may be as small as a singlemismatched nucleotide, several mismatches, or may span up to aboutseveral kilobases or more of nonhomologous sequence.

3.5.3.12.1. Methods and Compositions for Inactivation of a Domain of aGene

Accordingly, in a preferred embodiment, the methods and compositions ofthe invention are used for inactivation of a domain of a gene. That is,exogenous targeting polynucleotides can be used to inactivate, decreaseor alter the biological activity of one or more domains in a gene of acell (or transgenic nonhuman animal or plant). This finds particular usein the generation of animal models of disease states, or in theelucidation of gene function and activity, similar to “knock out”experiments. Alternatively, the biological activity of the wild-typegene may be either decreased, or the wild-type activity altered to mimicdisease states. This includes genetic manipulation of non-coding genesequences that affect the transcription of genes, including, promoters,repressors, enhancers and transcriptional activating sequences.

3.5.3.12.1.1. Amino Acid Substitutions, Insertions or Deletions in theEndogenous Target Sequences

Thus in a preferred embodiment, homologous recombination of thetargeting polynucleotide and endogenous target sequence will result inamino acid substitutions, insertions or deletions in the endogenoustarget sequences, potentially both within the functional domain regionand outside of it, for example as a result of the incorporation of PCRtags. This will generally result in modulated or altered gene functionof the endogenous gene, including both a decrease or elimination offunction as well as an enhancement of function. Nonhomologous portionsare used to make insertions, deletions, and/or replacements in apredetermined endogenous targeted DNA sequence, and/or to make single ormultiple nucleotide substitutions in a predetermined endogenous targetDNA sequence so that the resultant recombined sequence (i.e., a targetedrecombinant endogenous sequence) incorporates some or all of thesequence information of the nonhomologous portion of the targetingpolynucleotide(s). Thus, the nonhomologous regions are used to makevariant sequences, i.e. targeted sequence modifications. In this way,site directed modifications may be done in a variety of systems for avariety of purposes.

3.5.3.12.1.1.1. Disruption by Either the Substitution, Insertion,Deletion or Frame Shifting of Nucleotides

The endogenous target sequence, generally nucleic acid encoding adomain, may be disrupted in a variety of ways. The term “disrupt” asused herein comprises a change in the coding or non-coding sequence ofan endogenous nucleic acid. In one preferred embodiment, a disruptedgene will no longer produce a functional gene product. In anotherpreferred embodiment, a disrupted gene produces a variant gene product.Generally, disruption may occur by either the substitution, insertion,deletion or frame shifting of nucleotides.

3.5.3.12.1.1.2. Disruption by Amino Acid Substitutions

In one embodiment, amino acid substitutions are made. This can be theresult of either the incorporation of a non-naturally occurring domainsequence into a target, or of more specific changes to a particularsequence outside of the domain sequence.

3.5.3.12.1.1.3. Disruption by an Insertion Sequence

In one embodiment, the endogenous sequence is disrupted by an insertionsequence. The term “insertion sequence” as used herein means one or morenucleotides which are inserted into an endogenous gene to disrupt it. Ingeneral, insertion sequences can be as short as 1 nucleotide or as longas a gene, as outlined herein. For non-gene insertion sequences, thesequences are at least 1 nucleotide, with from about 1 to about 50nucleotides being preferred, and from about 10 to 25 nucleotides beingparticularly preferred. An insertion sequence may comprise a polylinkersequence, with from about 1 to about 50 nucleotides being preferred, andfrom about 10 to 25 nucleotides being particularly preferred. Insertionsequence may be a PCR tag used for identification of the first gene. Ina preferred embodiment, an insertion sequence comprises a gene which notonly disrupts the endogenous gene, thus preventing its expression, butalso can result in the expression of a new gene product. Thus, in apreferred embodiment, the disruption of an endogenous gene by aninsertion sequence gene is done in such a manner to allow thetranscription and translation of the insertion gene. An insertionsequence that encodes a gene may range from about 50 bp to 5000 bp ofcDNA or about 5000 bp to 50000 bp of genomic DNA. As will be appreciatedby those in the art, this can be done in a variety of ways. In apreferred embodiment, the insertion gene is targeted to the endogenousgene in such a manner as to utilize endogenous regulatory sequences,including promoters, enhancers or a regulatory sequence. In an alternateembodiment, the insertion sequence gene includes its own regulatorysequences, such as a promoter, enhancer or other regulatory sequenceetc.

Particularly preferred insertion sequence genes include, but are notlimited to, genes which encode selection or reporter proteins. Inaddition, the insertion sequence genes may be modified or variant genes.

3.5.3.12.1.1.4. Disruption by Deletions

The term “deletion” as used herein comprises removal of a portion of thenucleic acid sequence of an endogenous gene. Deletions range from about1 to about 100 nucleotides, with from about 1 to 50 nucleotides beingpreferred and from about 1 to about 25 nucleotides being particularlypreferred, although in some cases deletions may be much larger, and mayeffectively comprise the removal of the entire functional domain, theentire endogenous gene and/or its regulatory sequences. Deletions mayoccur in combination with substitutions or modifications to arrive at afinal modified endogenous gene.

3.5.3.12.1.1.5. Disruption Simultaneously by an Insertion and a Deletion

In a preferred embodiment, endogenous genes may be disruptedsimultaneously by an insertion and a deletion. For example, a domain ofan endogenous gene, with or without its regulatory sequences, may beremoved and replaced with an insertion sequence gene. Thus, for example,all but the regulatory sequences of an endogenous gene may be removed,and replaced with an insertion sequence gene, which is now under thecontrol of the endogenous gene's regulatory elements.

The term “regulatory element” is used herein to describe a non-codingsequence which affects the transcription or translation of a geneincluding, but are not limited to, promoter sequences, ribosomal bindingsites, transcriptional start and stop sequences, translational start andstop sequences, enhancer or activator sequences, dimerizing sequences,etc. In a preferred embodiment, the regulatory sequences include apromoter and transcriptional start and stop sequence. Promoter sequencesencode either constitutive or inducible promoters. The promoters may beeither naturally occurring promoters or hybrid promoters. Hybridpromoters, which combine elements of more than one promoter, are alsoknown in the art, and are useful in the present invention.

In addition to domain homology clamps and optional internal homologyclamps, the targeting polynucleotides of the invention may compriseadditional components, such as cell-uptake components, chemicalsubstituents, purification tags, etc.

3.53.12.2. Targeting Polynucleotide Comprising A Cell-Uptake Component

In a preferred embodiment, at least one of the targeting polynucleotidescomprises at least one cell-uptake component. As used herein, the term“cell-uptake component” refers to an agent which, when bound, eitherdirectly or indirectly, to a targeting polynucleotide, enhances theintracellular uptake of the targeting polynucleotide into at least onecell type (e.g., hepatocytes). A targeting polynucleotide of theinvention may optionally be conjugated, typically by covalently orpreferably noncovalent, binding, to a cell-uptake component. Variousmethods have been described in the art for targeting DNA to specificcell types. A targeting polynucleotide of the invention can beconjugated to essentially any of several cell-uptake components known inthe art. For targeting to hepatocytes, a targeting polynucleotide can beconjugated to an asialoorosomucoid (ASOR)-poly-L-lysine conjugate bymethods described in the art and incorporated herein by reference (Wu GY and Wu C H (1987) J. Biol. Chem. 262: 4429; Wu G Y and Wu C H (1988)Biochemistry 27: 887; Wu G Y and Wu C H (1988) J. Biol. Chem. 263: 14621; Wu G Y and Wu C H (1992) J. Biol. Chem. 267: 12436; Wu et al. (1991)J. Biol. Chem. 266: 14338; and Wilson et al. 0 992) J. Biol. Chem. 267:963, WO92/06180; WO92/05250; and WO91/17761, which are incorporatedherein by reference).

Alternatively, a cell-uptake component may be formed by incubating thetargeting polynucleotide with at least one lipid species and at leastone protein species to form protein-lipid-polynucleotide complexesconsisting essentially of the targeting polynucleotide and thelipid-protein cell-uptake component. Lipid vesicles made according toFeigner (WO91/17424, incorporated herein by reference) and/or cationiclipidization (WO91/16024, incorporated herein by reference) or otherforms for polynucleotide administration (EP 465,529, incorporated hereinby reference) may also be employed as cell-uptake components. Nucleases,DNA damaging chemicals, UV radiation or gamma-radiation may also beused.

In addition to cell-uptake components, targeting components such asnuclear localization signals may be used, as is known in the art. Seefor example Kido et al., Exper. Cell Res. 198: 107-114 (1992), herebyexpressly incorporated by reference. Typically, a targetingpolynucleotide of the invention is coated with at least one recombinaseand is conjugated to a cell-uptake component, and the resulting celltargeting complex is contacted with a target cell under uptakeconditions (e.g., physiological conditions) so that the targetingpolynucleotide and the recombinase(s) are internalized in the targetcell. A targeting polynucleotide may be contacted simultaneously orsequentially with a cell-uptake component and also with a recombinase;preferably the targeting polynucleotide is contacted first with arecombinase, or with a mixture comprising both a cell-uptake componentand a recombinase under conditions whereby, on average, at least aboutone molecule of recombinase is noncovalently attached per targetingpolynucleotide molecule and at least about one cell-uptake componentalso is noncovalently attached. Most preferably, coating of bothrecombinase and cell-uptake component saturates essentially all of theavailable binding sites on the targeting polynucleotide. A targetingpolynucleotide may be preferentially coated with a cell-uptake componentso that the resultant targeting complex comprises, on a molar basis,more cell-uptake component than recombinase(s). Alternatively, atargeting polynucleotide may be preferentially coated withrecombinase(s) so that the resultant targeting complex comprises, on amolar basis, more recombinase(s) than cell-uptake component.

Cell-uptake components are included with recombinase-coated targetingpolynucleotides of the invention to enhance the uptake of therecombinase-coated targeting polynucleotide(s) into cells, particularlyfor in vivo gene targeting applications, such as gene therapy to treatgenetic diseases, including neoplasia, and targeted homologousrecombination to treat viral infections wherein a viral sequence (e.g.,an integrated hepatitis B virus (HBV) genome or genome fragment) may betargeted by homologous sequence targeting and inactivated.Alternatively, a targeting polynucleotide may be coated with thecell-uptake component and targeted to cells with a contemporaneous orsimultaneous administration of a recombinase (e.g., liposomes orimmunoliposomes containing a recombinase, a viral-based vector encodingand expressing a recombinase).

In addition to recombinase and cellular uptake components, at least oneof the targeting polynucleotides may include chemical substituents.Exogenous targeting polynucleotides that have been modified withappended chemical substituents may be introduced along with recombinase(e.g., RecA) into a metabolically active target cell to homologouslypair with a predetermined endogenous DNA target sequence in the cell. Ina preferred embodiment, the exogenous targeting polynucleotides arederivatized, and additional chemical substituents are attached, eitherduring or after polynucleotide synthesis, respectively, and are thuslocalized to a specific endogenous target sequence where they produce analteration or chemical modification to a local DNA sequence. Preferredattached chemical substituents include, but are not limited to:cross-linking agents (see Podyminogin et al., Biochem. 34: 13098 (1995)and 35: 7267 (1996), both of which are hereby incorporated byreference), nucleic acid cleavage agents, metal chelates (e.g.,iron/EDTA chelate for iron catalyzed cleavage), topoisomerases,endonucleases, exonucleases, ligases, phosphodiesterases, photodynamicporphyrins, chemotherapeutic drugs (e.g., adriamycin, doxirubicin),intercalating agents, labels, base-modification agents, agents whichnormally bind to nucleic acids such as labels, etc. (see for exampleAfonina et al., PNAS USA 93: 3199 (1996), incorporated herein byreference) immunoglobulin chains, and oligonucleotides. Iron/EDTAchelates are particularly preferred chemical substituents where localcleavage of a DNA sequence is desired (Hertzberg et al. (1982) J. Am.Chem. Soc. 104: 313; Hertzberg and Dervan (1984) Biochemistry 23: 3934;Taylor et al. (1984) Tetrahedron 40: 457; Dervan, P B (1986) Science232: 464, which are incorporated herein by reference). Further preferredare groups that prevent hybridization of the complementary singlestranded nucleic acids to each other but not to unmodified nucleicacids; see for example Kutryavin et al., Biochem. 35: 11170 (1996) andWoo et al., Nucleic Acid. Res. 24(13): 2470 (1996), both of which areincorporated by reference. 2′-0 methyl groups are also preferred; seeCole-Strauss et al., Science 273: 1386 (1996); Yoon et al., PNAS 93:2071 (1996)). Additional preferred chemical substituents includelabeling moieties, including fluorescent labels. Preferred attachmentchemistries include: direct linkage, e.g., via an appended reactiveamino group (Corey and Schultz (1988) Science 238: 1401, which isincorporated herein by reference) and other direct linkage chemistries,although streptavidin/biotin and digoxigenin/antidigoxigenin antibodylinkage methods may also be used. Methods for linking chemicalsubstituents are provided in U.S. Pat. Nos. 5,135,720, 5,093,245, and5,055,556, which are incorporated herein by reference. Other linkagechemistries may be used at the discretion of the practitioner.

3.5.3.12.3. Targeting Polynucleotides Comprises at Least OnePurification Tag or Capture Moiety

In a preferred embodiment, at least one of the targeting polynucleotidescomprises at least one purification tag or capture moiety, some of whichare discussed above as chemical substituents, for example biotin,digoxigenin, psoralen, etc. Alternatively, the domain oligonucleotidecould be directly attached to beads with the targeting reactionperformed on a solid phase support.

3.5.3.12.4. Targeting Polynucleotides are Coated with Recombinase Priorto Introduction to the Domain Target

In a preferred embodiment, the targeting polynucleotides are coated withrecombinase prior to introduction to the domain target. The proceduresbelow are directed to the use of E. coli RecA, although as will beappreciated by those in the art, other recombinases may be used as well.Targeting polynucleotides can be coated using GTPgammaS, mixes ofATPgammaS with rATP, rGTP and/or dATP, or dATP or rATP alone in thepresence of an rATP generating system (Boehringer Mannheim). Variousmixtures of GTPgammaS, ATPgammaS, ATP, AIDP, dATP and/or rATP or othernucleosides may be used, particularly preferred are mixes of ATPgammaSand ATP or ATPgammaS and ADP.

The targeting polynucleotide, whether double-stranded orsingle-stranded, is denatured by heating in an aqueous solution at95-100′C for five minutes, then placed in an ice bath for 20 seconds toabout one minute followed by centrifugation at 0′C for approximately 20sec, before use. When denatured targeting polynucleotides are not placedin a freezer at −20′C they are usually immediately added to standardRecA coating reaction buffer containing ATPgammaS, at room temperature,and to this is added the RecA protein.

Alternatively, RecA protein may be included with the buffer componentsand ATPgammaS before the polynucleotides are added.

RecA coating of targeting polynucleotide(s) is initiated by incubatingpolynucleotide-RecA mixtures at 37′C for 10-15 min. RecA proteinconcentration tested during reaction with polynucleotide variesdepending upon polynucleotide size and the amount of addedpolynucleotide, and the ratio of RecA molecule: nucleotide preferablyranges between about 3:1 and 1:3. When single-stranded polynucleotidesare RecA coated independently of their homologous polynucleotidestrands, the mM and microM concentrations of ATPgammaS and RecA,respectively, can be reduced to one-half those used with double-strandedtargeting polynucleotides (i.e., RecA and ATPgammaS concentration ratiosare usually kept constant at a specific concentration of individualpolynucleotide strand, depending on whether a single- or double-strandedpolynucleotide is used).

RecA protein coating of targeting polynucleotides is normally carriedout in a standard 10× RecA coating reaction buffer. 10× RecA reactionbuffer (i.e., 10×AC buffer) consists of. 100 mM Tris acetate (pH 7.5 at37° C.), 20 mM magnesium acetate, 500 mM sodium acetate, 10 mM DTT, and50% glycerol). All of the targeting polynucleotides, whetherdouble-stranded or single-stranded, typically are denatured before useby heating to 95-100° C. for five minutes, placed on ice for one minute,and subjected to centrifugation (10,000 rpm) at 0° C. for approximately20 seconds (e.g., in a Tomy centrifuge). Denatured targetingpolynucleotides usually are added immediately to room temperature RecAcoating reaction buffer mixed with ATPgammaS and diluted withdouble-distilled H₂O as necessary.

A reaction mixture typically contains the following components: (i)0.2-4.8 mM ATPgammaS; and (ii) between 1-100 ng/ul of targetingpolynucleotide. To this mixture is added about 1-20 μl of RecA proteinper 10-100 ul of reaction mixture, usually at about 2-10 mg/ml(purchased from Pharmacia or purified), and is rapidly added and mixed.The final reaction volume-for RecA coating of targeting polynucleotideis usually in the range of about 10-500 ul. RecA coating of targetingpolynucleotide is usually initiated by incubating targetingpolynucleotide-RecA mixtures at 37° C. for about 10-15 min. RecA proteinconcentrations in coating reactions varies depending upon targetingpolynucleotide size and the amount of added targeting polynucleotide:RecA protein concentrations are typically in the range of 5 to 50 uM.When single-stranded targeting polynucleotides are coated with RecA,independently of their complementary strands, the concentrations ofATPgammaS and RecA protein may optionally be reduced to about one-halfof the concentrations used with double-stranded targetingpolynucleotides of the same length: that is, the RecA protein andATPgammaS concentration ratios are generally kept constant for a givenconcentration of individual polynucleotide strands.

3.5.3.12.4.1. Evaluation of Coating of Targeting Polynucleotides withRecA Protein

The coating of targeting polynucleotides with RecA protein can beevaluated in a number of ways. First, protein binding to DNA can beexamined using band-shift gel assays (McEntee et al., (1981) 1. Biol.Chem. 256: 8835). Labeled polynucleotides can be coated with RecAprotein in the presence of ATPgammaS and the products of the coatingreactions may be separated by agarose gel electrophoresis.

Following incubation of RecA protein with denatured duplex DNAs the RecAprotein effectively coats single-stranded targeting polynucleotidesderived from denaturing a duplex DNA. As the ratio of RecA proteinmonomers to nucleotides in the targeting polynucleotide increases from0, 1:27, 1:2.7 to 3.7:1 for 121-mer and 0, 1:22, 1:2.2 to 4.5:1 for159-mer, targeting polynucleotide's electrophoretic mobility decreases,i.e., is retarded, due to RecA-binding to the targeting polynucleotide.Retardation of the coated polynucleotide's mobility reflects thesaturation of targeting polynucleotide with RecA protein. An excess ofRecA monomers to DNA nucleotides is required for efficient RecA coatingof short targeting polynucleotides (Leahy et al., (1986) J. Biol. Chem.261: 954).

A second method for evaluating protein binding to DNA is in the use ofnitrocellulose fiber binding assays (Leahy et al., (1986) J. Biol. Chem.261: 6954; Woodbury, et al., (1983) Biochemistry 22(20): 4730-4737. Thenitrocellulose filter binding method is particularly useful indetermining the dissociation-rates for protein: DNA complexes usinglabeled DNA. In the filter binding assay, DNA: protein complexes areretained on a filter while free DNA passes through the filter. Thisassay method is more quantitative for dissociation-rate determinationsbecause the separation of DNA: protein complexes from free targetingpolynucleotide is very rapid.

Alternatively, recombinase protein(s) (prokaryotic, eukaryotic orendogeneous to the target cell) may be exogenously induced oradministered to a target cell simultaneously or contemporaneously (i.e.,within about a few hours) with the targeting polynucleotide(s). Suchadministration is typically done by micro-injection, althoughelectroporation, lipofection, and other transfection methods known inthe art may also be used. Alternatively, recombinase-proteins may beproduced in vivo. For example, they may be produced from a homologous orheterologous expression cassette in a transfected cell or targeted cell,such as a transgenic totipotent cell (e.g. a fertilized zygote) or anembryonal stem cell (e.g., a murine ES cell such as AB-1) used togenerate a transgenic non-human animal line or a somatic cell or apluripotent hematopoietic stem cell for reconstituting all or part of aparticular stem cell population (e.g. hematopoietic) of an individual.Conveniently, a heterologous expression cassette includes a modulatablepromoter, such as an ecdysone-inducible promoter-enhancer combination,an estrogen-induced promoter-enhancer combination, a CMVpromoter-enhancer, an insulin gene promoter, or other cell-typespecific, developmental stage-specific, hormone-inducible druginducible, or other modulatable promoter construct so that expression ofat least one species of recombinase protein from the cassette can bymodulated for transiently producing recombinase(s) in vivo simultaneousor contemporaneous with introduction of a targeting polynucleotide intothe cell. When a hormone-inducible promoter-enhancer combination isused, the cell must have the required hormone receptor present, eithernaturally or as a consequence of expression a co-transfected expressionvector encoding such receptor. Alternatively, the recombinase may beendogeneous and produced in high levels. In this embodiment, preferablyin eukaryotic target cells such as tumor cells, the target cells producean elevated level of recombinase. In other embodiments the level ofrecombinase may be induced by DNA damaging agents, such as mitomycin C,UV or gamma-irradiation. Alternatively, recombinase, levels may beelevated by transfection of a plasmid encoding the recombinase gene intothe cell.

3.53.13. Specialized Applications

3.5.3.13.1. Identification of New Members of Gene Families Which may beUseful in Functional Genomic Studies as well as in the Identification ofNew Drug Targets

Once made, the compositions of the invention find use in a number ofapplications upon administration to target cells. In general, thecompositions and methods of the invention are useful to identify newmembers of gene families which may be useful in functional genomicstudies as well as in the identification of new drug targets; both ofthese may be accomplished through the generation of “knock out” animalmodels. In addition, the present invention allows the modification offunctional domain targets, the creation of transgenic plants andanimals, the cloning of genes containing domain functional domains, etc.

3.5.3.13.2. Domain Specific Gene Evolution

Once made and administered to a target host cell, the compositions ofthe invention find use in a number of applications, including domainspecific gene evolution. The polypeptide or protein encoded by thetargeted nucleic undergoes homologous recombination with the pluralityof polynucleotides to produce a plurality of modified target nucleicacids that are expressed to produce a plurality of modified proteins.Selection systems are employed to identify and isolate host cellsexpressing proteins having a desired property or phenotype. For example,if the expressed protein is an enzyme, cells having a modified enzymeactivity are identified. The desired activity can be an increased ordecreased or altered activity. Proteins having the desired phenotype areselected and isolated, the modified nucleic acid is sequenced toidentify sequences effecting the desired activity, and the process isrepeated iteratively as needed to produce a protein having a desiredactivity or property. In this and other embodiments, suitable targetsequences include nucleic acid sequences encoding therapeutically orcommercially relevant proteins, including, but not limited to, enzymes(proteases, recombinases, lipases, kinases, carbohydrases, isomerases,tautomerases, nucleases etc.), hormones, receptors, transcriptionfactors, growth factors, cytokines, globin genes, immunosupppressivegenes, tumor suppressors, oncogenes, complement-activating genes, milkproteins (casein, alpha-lactalbumin, beta-lactoglobulin, bovine andhuman serum albumin), immunoglobulins, milk proteins, and pharmaceuticalproteins and vaccines.

In a preferred embodiment, the methods of the invention are used togenerate pools or libraries of variant nucleic acid sequences, andcellular libraries containing the variant sequences. This idea issomewhat similar to the “gene shuffling” techniques of the literature(see Stemmer et al., 1994, Natuere 370: 389 which attempt to rapidly“evolve” genes by making multiple random changes simultaneously. In thepresent invention, this end is accomplished by using at least one cycle,and preferably reiterative cycles, of enhanced homologous recombinationwith targeting polynucleotides containing random mismatches,substitutions, insertions, or deletions. By using a library of targetingpolynucleotides comprising a plurality of random mutations, andrepeating the homologous recombination steps as many times as needed, arapid “gene evolution” can occur, wherein the new genes may containlarge numbers of mutations.

Thus, in this embodiment, a plurality of targeting polynucleotides areused. The targeting polynucleotides each have at least one homologyclamp that substantially corresponds to or is substantiallycomplementary to the target sequence. Generally, the targetingpolynucleotides are generated in pairs; that is, pairs of two singlestranded targeting polynucleotides that are substantially complementaryto each other are made (i.e. a Watson strand and a Crick strand).However, as will be appreciated by those in the art, less than a one toone ratio of Watson to Crick strands may be used; for example, an excessof one of the single stranded target polynucleotides (i.e. Watson) maybe used. Preferably, sufficient numbers of each of Watson and Crickstrands are used to allow the majority of the targeting polynucleotidesto form double D-loops, which are preferred over single D-loops asoutlined above. In addition, the pairs need not have perfectcomplementarity; for example, an excess of one of the single strandedtarget polynucleotides (i.e. Watson), which may or may not containmismatches, may be paired to a large number of variant Crick strands,etc. Due to the random nature of the pairing, one or both of anyparticular pair of single-stranded targeting polynucleotides may notcontain any mismatches. However, generally, at least one of the strandswill contain at least one mismatch.

The plurality of pairs preferably comprise a pool or library ofmismatches. The size of the library will depend on a number of factors,including the number of residues to be mutagenized, the succeptibilityof the protein to mutation, etc., as will be appreciated by those in theart. Generally, a library in this instance preferably comprises at least10% different mismatches over the length of the targetingpolynucleotides, with at least 30% mismatches being preferred and atleast 40% being particularly preferred, although as will be appreciatedby those in the art, lower (1, 2, 5%, etc.) or higher amounts ofmismatches being both possible and desirable in some instances. That is,the plurality of pairs comprise a pool of random and preferablydegenerate mismatches over some regions or all of the entire targetingsequence. As outlined herein, “mismatches” include substitutions,insertions and deletions, with the former being preferred. Thus, forexample, a pool of degenerate variant targeting polynucleotides coveringsome, or preferably all, possible mismatches over some region aregenerated, as outlined above, using techniques well known in the art.Preferably, but not required, the variant targeting polynucleotides eachcomprise only one or a few mismatches (less than 10), to allow completemultiple randomization. That is, by repeating the homologousrecombination steps any number of times, as is more fully outlinedbelow, the mismatches from a plurality of probes can be incorporatedinto a single target sequence.

The mismatches can be either non-random (i.e. targeted) or random,including biased randomness. That is, in some instances specific changesare desirable, and thus the sequence of the targeting polynucleotidesare specifically chosen. In a preferred embodiment, the mismatches arerandom. The targeting polynucleotides can be chemically synthesized, andthus may incorporate any nucleotide at any position. The syntheticprocess can be designed to generate randomized nucleic acids, to allowthe formation of all or most of the possible combinations over thelength of the nucleic acid, thus forming a library of randomizedtargeting polynucleotides. Preferred methods maximize library size anddiversity.

It is important to understand that in any library system encoded byoligonucleotide synthesis one cannot have complete control over thecodons that will eventually be incorporated into the peptide structure.This is especially true in the case of codons encoding stop signals(TAA, TGA, TAG). In a synthesis with NNN as the random region, there isa {fraction (3/64)}, or 4.69%, chance that the codon will be a stopcodon. To alleviate this, random residues are encoded as NNK, where K=Tor G. This allows for encoding of all potential amino acids (changingtheir relative representation slightly), but importantly preventing theencoding of two stop residues TAA and TGA.

3.5.3.13.2.1. Mismatches are Fully Randomized, with No SequencePreferences or Constants at any Position

In one embodiment, the mismatches are fully randomized, with no sequencepreferences or constants at any position.

3.5.3.13.2.2. Biased Library

In a preferred embodiment, the library is biased. That is, somepositions within the sequence are either held constant, or are selectedfrom a limited number of possibilities. For example, in a preferredembodiment, the nucleotides or amino acid residues are randomized withina defined class, for example, of hydrophobic amino acids, hydrophilicresidues, sterically biased (either small or large) residues, towardsthe creation of cysteines, for cross-linking, prolines for SH-3 domains,serines, threonines, tyrosines or histidines for phosphorylation sites,etc., or to purines, etc.

As will be appreciated by those in the art, the introduction of a poolof variant targeting polynucleotides (in combination with recombinase)to a target sequence, in vitro to an extrachromosomal sequence, canresult in a large number of homologous recombination reactions occuringover time. That is, any number of homologous recombination reactions canoccur on a single target sequence, to generate a wide variety of singleand multiple mismatches within a single target sequence, and a libraryof such variant target sequences, most of which will contain mismatchesand be different from other members of the library. This thus works togenerate a library of mismatches.

3.5.3.13.2.2.1. Generating a Large Number of Different Variants within aParticular Region of a Sequence, Similar to Cassette Mutagenesis but notLimited by Sequence Length

In a preferred embodiment, the variant targeting polynucleotides aremade to a particular region or domain of a sequence (i.e. a nucleotidesequence that encodes a particular protein domain). For example, it maybe desirable to generate a library of all possible variants of a bindingdomain of a protein, without affecting a different biologicallyfunctional domain, etc. Thus, the methods of the present invention findparticular use in generating a large number of different variants withina particular region of a sequence, similar to cassette mutagenesis butnot limited by sequence length. This idea is sometimes referred toherein as “domain specific gene evolution”. In addition, two or moreregions may also be altered simultaneously using these techniques; thus“single domain” and “multi-domain” shuffling can be done. Suitabledomains include, but are not limited to, kinase domains,nucleotide-binding sites, DNA binding sites, signaling domains, receptorbinding domains, transcriptional activating regions, promoters, origins,leader sequences, terminators, localization signal domains, and, inimmunoglobulin genes, the complementarity determining regions (CDR), Fc,V_(H), and V_(L).

In a preferred embodiment, the variant targeting polynucleotides aremade to the entire target sequence. In this way, a large number ofsingle and multiple mismatches may be made in an entire sequence.

Thus, this embodiment proceeds as follows. A pool of, targetingpolynucleotides are made each containing one or more mismatches. Theprobes are coated with recombinase as generally described herein, andintroduced to the target sequence. Upon binding of the probes to formD-loops, the recombinase is preferably removed. These polynucleotide:target sequences can then introduced into recombinant proficient cells,to produce target protein which can then be tested for biologicalactivity, based on the identification of the target sequence. Dependingon the results, the altered target sequence can be used as the startingtarget sequence in reiterative rounds of homologous recombination,generally using the same library. Preferred embodiments utilize at leasttwo rounds of homologous recombination, with at least 5 rounds beingpreferred and at least 10 rounds being particularly preferred. Again,the number of reiterative rounds that are performed will depend on thedesired end-point, the resistance or succeptibility of the protein tomutation, the number of mismatches in each probe, etc.

3.5.3.14. Target Sequence

3.5.3.14.1. Target Sequences—an Immunoglobulin

In a preferred embodiment, the target sequence is an immunoglobulin. Theamino terminal region of the light and heavy chains of an antibody thatcome together to form the antigen binding site and the variability oftheir amino acid sequences provides the structural basis for thediversity of antigen binding sites. The variability of the variableregions of both the heavy and light chains is for the most partrestricted to three small hypervariable regions in each chain. Theremianing part of the variable regions, known as framework regions, isrelatively constant. Each of the hypervariable regions consists of onlyabout 5 to 10 amino acids; the corresponding regions in the DNA encodingthese regions are known as the complementarity determining regions, orCDRs. Thus to engineer an antibody library, for example an antibodyphage library, one can change the sequences in the CDR regions of boththe heavy and light chains. Different permutations and combinations ofCDRs can be changed and evolved to engineer antibody-phage libraries.

3.5.3.14.2. Target Sequence—a Single-Chain Fv Framework for any Numberof Specific Antigens

In a preferred embodiment, the target sequence is a single-chain Fvframework for any number of specific antigens. Single chain Fv (scFv)consists Of V_(L) and V_(H) domains of an immunoglobulin linked by apeptide spacer and thus contains the minimal antigen-binding domains ofan antibody.

3.5.3.14.3. Target Sequence—an Antibody-Phage Fusion

In a preferred embodiment, antibody-phage fusions are used as the targetsequence. As is known in the art, single-chain Fv fusions with the pillminor coat protein allows expression of the antibody on the surface of aphage, wherein it is available to bind antigen. Five copies of pill areexpressed on the surface of the phage. It is therefor possible toexpress five scFv on the phage. This antibody-phage display system hasbeen used previously to isolate novel antibodies. By starting withantibodies to any antigen, higher affinity antibodies may be made, aswell as novel antibodies.

3.5.3.14.4. Target Sequence—the Coding Sequence for beta-Lactamase

In a preferred embodiment, the target sequence is the coding sequencefor beta-lactamase.

Thus, the methods of the invention may be used to create superiorrecombinant reporter genes such as lacZ and green fluorescent protein(GFP); superior antibiotic and drug resistance genes; superiorrecombinase genes; superior recombinant vectors; and other superiorrecombinant genes and proteins, including immunoglobulins, vaccines orother proteins with therapeutic value. For example, targetingpolynucleotides containing any number of alterations may be made to oneor more functional or structural domains of a protein, and then theproducts of homologous recombination evaluated.

Once made and administered to target cells, the target cells may bescreened to identify a cell that contains the targeted sequencemodification. This will be done in any number of ways, and will dependon the target gene and targeting polynucleotides as will be appreciatedby those in the art. The screen may be based on phenotypic, biochemical,genotypic, or other functional changes, depending on the targetsequence. In an additional embodiment, as will be appreciated by thosein the art, selectable markers or marker sequences may be included inthe targeting polynucleotides to facilitate later identification.

3.5.3.15. Kits Containing the Compositions of the Invention are Provided

In a preferred embodiment, kits containing the compositions of theinvention are provided. The kits include the compositions, particularlythose of libraries or pools of degenerate cssDNA probes, along with anynumber of reagents or buffers, including recombinases, buffers, ATP,etc.

3.5.3.16. Targeting Polynucleotide: Target Nucleic Acid Complexes Serveas Substrates for Single-Stranded Endonucleases

In an alternate embodiment, the targeting polynucleotide: target nucleicacid complexes serve as substrates for single-stranded endonucleases,such as, S1 and mung bean nuclease. Preferably the targetingpolynucleotides are substantially complementary and form double D-loopswith the target nucleic acid. The junctions of the complexes aresingle-stranded in nature, and thus are suceptible to single-strandspecific nucleases and junction-specific nucleases. Accordingly,treatment of the complex with a single-strand nuclease results indefined nicks in the selected region encoding a predetermined domain ofa protein encoded by the target nucleic acid. The nicked target nucleicacid is disassociated from the targeting polynucleotides and arereassembled and “shuffled” in vitro by PCR (Stemmer. 1994. Nature 370:389-391) to produce a plurality modified nucleic acids. The modifiednucleic acids are introduced into an appropriate host cell, as describedabove, for expression of the plurality of modified proteins. Selectiontechniques are used as described herein to identify and isolate a cellexpressing a modified protein. The process is repeated iteratively asneeded to further evolve the targeted nucleic acid.

3.5.3.17. Isolation of New Members of Gene Families that CompriseParticular Domains

In a preferred embodiment, the present invention finds use in theisolation of new members of gene families that comprise particulardomains. The use of domain filaments (i.e. domain homology clampspreferably containing a purification tag such as biotin, disoxisenin, orone purification method such as the use of a RecA antibody), allows theidentification of genes containing the domain. Once identified, the newgenes can be cloned, sequenced and the protein gene products purified.As will be appreciated by those in the art, the functional importance ofthe new genes can be assessed in a number of ways, including functionalstudies on the protein level, as well as the generation of “knock out”animal models. By choosing domain sequences for therapeutically relevantprotein domains, novel targets can be identified that can be used inscreening of drug candidates.

3.5.3.18. Utilizing the Purification Tag to Isolate the Gene(s)

Thus, in a preferred embodiment, the present invention provides methodsfor isolating new members of gene families containing protein domainscomprising introducing targeting polynucleotides comprising domainhomology clamps and at least one purification tag, preferably biotin, toa mix of nucleic acid, such as a plasmid cDNA library or a cell, andthen utilizing the purification tag to isolate the gene(s). The exactmethods will depend on the purification tag; a preferred method utilizesthe attachment of the binding ligand for the tag to a bead, which isthen used to pull out the sequence. Alternatively anti-RecA antibodiescould be used to capture RecA-coated probes. The genes are then cloned,sequenced, and reassembled if necessary, as is well known in the art.

3.5.3.19. Use in Functional Genomic Studies, by Providing the Creationof Transgenic Animal Models of Disease

In an alternate preferred embodiment, the present invention finds use infunctional genomic studies, by providing the creation of transgenicanimal models of disease. Thus, for example, domain sequences used inhomologous recombination methods can generate animals that have a widevariety of mutations in a wide variety of related domains of genes,potentially resulting in a wide variety of phenotypes, includingphenotypes related to disease states. That is, by targeting a domainfamily, one, two or multiple genes in the family may be altered in anygiven experiment, thus creating a wide variety of genotypes andphenotypes to evaluate. Thus, in a preferred embodiment, thecompositions and methods of the invention are used to generate pools orlibraries of variant nucleic acid sequences, wherein the mutations arewithin the functional domain coding region, cellular librariescontaining the variant libraries, and libraries of animals containingthe variant libraries.

Furthermore, domain targeting can be used in cells or animals that arediseased or altered; in essence, domain targeting can be done toidentify “reversion” genes, genes that can modulate disease statescaused by domains of different genes. Thus for example the loss of onetype of enzymatic activity, resulting in a disease phenotype, may becompensated by alterations in a different but homologous enzymaticactivity. Accordingly, once the recombinase-targeting polynucleotidecompositions are formulated, they are introduced or administered intotarget cells. The administration is typically done as is known for theadministration of nucleic acids into cells, and, as those skilled in theart will appreciate, the methods may depend on the choice of the targetcell. Suitable methods include, but are not limited to, microinjection,electroporation, lipofection, etc. By “target cells” herein is meantprokaryotic or eukaryotic cells. Suitable prokaryotic cells include, butare not limited to, bacteria such as E coli, Bacillus species, and theextremophile bacteria such as thermophiles, halophiles, etc. Preferably,the procaryotic target cells are recombination competent. Suitableeukaryotic cells include, but are not limited to, fungi such as yeastand filamentous fungi, including species of Aspergillus, Trichoderma,and Neurospora; plant cells including those of corn, sorghum, tobacco,canola, soybean, cotton, tomato, potato, alfalfa, sunflower, etc.; andanimal cells, including fish, reptiles, amphibia, birds and mammals.Suitable fish cells include, but are not limited to, those from speciesof salmon, trout, tilapia, tuna, carp, flounder, halibut, swordfish, codand zebrafish. Suitable bird cells include, but are not limited to,those of chickens, ducks, quail, pheasants, ostrich, and turkeys, andother jungle foul or game birds. Suitable mammalian cells include, butare not limited to, cells from horses, cows, buffalo, deer, sheep,rabbits, rodents such as mice, rats, hamsters and guinea pigs, goats,pigs, primates, marine mammals including dolphins and whales, as well ascell lines, such as human cell lines of any tissue or stem cell type,and stem cells, including pluripotent and non-pluripotent, and non-humanzygotes. Particular human cells including, but are not limited to, tumorcells of all types (particularly melanoma, myeloid leukemia, carcinomasof the lung, breast, ovaries, colon, kidney, prostate, pancreas andtestes), cardiomyocytes, endothelial cells, epithelial cells,lymphocytes (T-cell and B cell), mast cells, eosinophils, vascularintimal cells, hepatocytes, leukocytes including mononuclear leukocytes,stem cells such as haemopoetic, neural, skin, lung, kidney, liver andmyocyte stem cells, osteoclasts, chondrocytes and other connectivetissue cells, keratinocytes, melanocytes, liver cells, kidney cells, andadipocytes. Suitable cells also include known research cells, including,but not limited to, Jurkat T cells, mouse La, HT1080, Cl 27, Rat2, CV-1,NIH3T3 cells, CHO, COS, 293 cells, etc. See the ATCC cell line catalog,hereby expressly incorporated by reference.

3.5.3.20. Procaryotic Cells are Used to Identify, Clone, or Alter TargetSequences

In a preferred embodiment, procaryotic cells are used to identify,clone, or alter target sequences, preferably protein domains. In thisembodiment, a pre-selected target DNA sequence is chosen for alteration.Preferably, the pre-selected target DNA sequence is contained within anextrachromosomal sequence. By “extrachromosomal sequence” herein ismeant a sequence separate from the chromosomal or genomic sequences.Preferred extrachromosomal sequences include plasmids (particularlyprocaryotic plasmids such as bacterial plasmids), p1 vectors, viralgenomes, yeast, bacterial and mammalian artificial chromosomes (YAC, BACand MAC, respectively), and other autonomously self-replicatingsequences, although this is not required. As described herein, arecombinase and at least two single stranded targeting polynucleotideswhich are substantially complementary to each other, each of whichcontain a homology clamp to the target sequence contained on theextrachromosomal sequence, are added to the extrachromosomal sequence,preferably in vitro. The two single stranded targeting polynucleotidesare preferably coated with recombinase, and at least one of thetargeting polynucleotides contain at least one nucleotide substitution,insertion or deletion. The targeting polynucleotides then bind to thetarget sequence in the extrachromosomal sequence to effect homologousrecombination and form an altered extrachromosomal sequence whichcontains the substitution, insertion or deletion. The alteredextrachromosomal sequence is then introduced into the procaryotic cellusing techniques known in the art. Preferably, the recombinase isremoved prior to introduction into the target cell, using techniquesknown in the art. For example, the reaction may be treated withproteases such as proteinase K, detergents such as SDS, and phenolextraction (including phenol:chloroform:isoamyl alcohol extraction).These methods may also be used for eukaryotic cells. The cells are thengrown under conditions which allow the expression of the variant nucleicacids to form variant proteins, particularly with alterations indomains.

3.53.20.1. Proteins Having the Desired Phenotype are Selected andIsolated

In a preferred embodiment, proteins having the desired phenotype areselected and isolated, the modified nucleic acid is sequenced toidentify sequences effecting the desired activity, and the process isrepeated iteratively as needed to produce a protein having a desiredactivity or property. Thus, in a preferred embodiment, the methods ofthe invention are repeated until the desired protein or phenotype isseen.

Alternatively, the pre-selected target DNA sequence is a chromosomalsequence. In this embodiment, the recombinase with the targetingpolynucleotides are introduced into the target cell, preferablyeukaryotic target cells. In this embodiment, it may be desirable to bind(generally non-covalently) a nuclear localization signal to thetargeting polynucleotides to facilitate localization of the complexes inthe nucleus. See for example Kido et al., Exper. Cell Res. 198: 107-114(1992), hereby expressly incorporated by reference. The targetingpolynucleotides and the recombinase function to effect homologousrecombination, resulting in altered chromosomal or genomic sequences.

3.5.3.21. Eukaryotic Cells are Used

In a preferred embodiment, eukaryotic cells are used. Basically, anymammalian cells may be used, with mouse, rat, primate and human cellsbeing particularly preferred. Accordingly, suitable cell types include,but are not limited to, tumor cells of all types, i.e., fibroblasts,epithelial cells (particularly melanoma, myeloid leukemia, carcinomas ofthe lung, breast, ovaries, colon, kidney, prostate, pancreas andtestes), cardiomyocytes, endothelial cells, epithelial cells,lymphocytes (T-cell and B cell), mast cells, eosinophils, vascularintimal cells, hepatocytes, leukocytes including mononuclear leukocytes,stem cells such as haemopoetic, neural, skin, lung, kidney, liver andmyocyte stem cells (for use in screening for differentiation andde-differentiation factors), osteoclasts, chondrocytes and otherconnective tissue cells, keratinocytes, melanocytes, liver cells, kidneycells, and adipocytes. Suitable cells also include known research cells,including, but not limited to, Jurkat T cells, NIH 3T3 cells, CHO, Cos,etc. See the ATCC cell line catalog, hereby expressly incorporated byreference.

For making transgenic non-human animals (which include homologouslytargeted non-human animals) embryonal stem cells (ES cells), donor cellsfor nuclear transfer and fertilized zygotes are preferred. In apreferred embodiment, embryonal stem cells are used. Murine ES cells,such as AB-1 line grown on mitotically inactive SNL76/7 cell feederlayers (McMahon and Bradley, Cell 62: 1073-1085 (1990)) essentially asdescribed (Robertson, E. J. (1987) in Teratocarcinomas and EmbcyonicStem Cells: A Practical Approach. E. J. Robertson, ed. (oxford: IRLPress), p. 71-112; Zjilstra et al., Nature 342: 435-438 (1989); andSchwartzberg et al., Science 246: 799-803 (1989), each of which isincorporated herein by reference) may be used for homologous genetargeting. Other suitable ES lines include, but are not limited to, theE14 line (Hooper et al. (1987) Nature 326: 292-295), the D3 line(Doetschman et al. (1985) J. Embryol. Exp. Morph. 87: 21-45), and theCCE line (Robertson et al. (1986) Nature 323: 445448). The success ofgenerating a mouse line from ES cells bearing a specific targetedmutation depends on the pluripotence of the ES cells (i.e., theirability, once injected into a host blastocyst, to participate inembryogenesis and contribute to the germ cells of the resulting animal).

The pluripotence of any given ES cell line can vary with time in cultureand the care with which it has been handled. The only definitive assayfor pluripotence is to determine whether the specific population of EScells to be used for targeting can give rise to chimeras capable ofgermline transmission of the ES genome. For this reason, prior to genetargeting, a portion of the parental population of AB-1 cells isinjected into C57B 1/6J blastocysts to ascertain whether the cells arecapable of generating chimeric mice with extensive ES cell contributionand whether the majority of these chimeras can transmit the ES genome toprogeny.

3.5.3.22. Non-Human Zygotes are Used

In a preferred embodiment, non-human zygotes are used, for example tomake transgenic animals, using techniques known in the art (see U.S.Pat. No. 4,873,191; Brinster et al., PNAS 86: 7007 (1989); Susulic etal., J. Biol. Chem. 49: 29483 (1995), and Cavard et al., Nucleic AcidsRes. 16: 2099 (1988), hereby incorporated by reference). Preferredzygotes include, but are not limited to, animal zygotes, including fish,avian, reptilian, amphibian and mammalian zygotes. Suitable fish zygotesinclude, but are not limited to, those from species of salmon, trout,tuna, carp, flounder, halibut, swordfish, cod, tilapia and zebrafish.Suitable bird zygotes include, but are not limited to, those ofchickens, ducks, quail, pheasant, turkeys, and other jungle fowl andgame birds. Suitable mammalian zygotes include, but are not limited to,cells from horses, cows, buffalo, deer, sheep, rabbits, rodents such asmice, rats, hamsters and guinea pigs, goats, pigs, primates, and marinemammals including dolphins and whales. See Hogan et al., Manipulatingthe Mouse Embryo (A Laboratory Manual), 2nd Ed. Cold Spring HarborPress, 1994, incorporated by reference.

The vectors containing the DNA segments of interest can be transferredinto the host cell by well-known methods, depending on the type ofcellular host. For example, micro-injection is commonly utilized fortarget cells, although calcium phosphate treatment, electroporation,lipofection, biolistics or viral-based transfection also may be used.Other methods used to transform mammalian cells include the use ofPolybrene, protoplast fusion, and others (see, generally, Sambrook etal. Molecular Cloning: A Laboratory Manual, 2d ed., 1989, Cold SpringHarbor Laboratory Press, Cold Spring Harbor, N.Y., which is incorporatedherein by reference). Direct injection of DNA and/or recombinase-coatedtargeting polynucleotides into target cells, such as skeletal or musclecells also may be used (Wolff et al. (1990) Science 247: 1465, which isincorporated herein by reference).

3.5.3.23. Precursor Animals or Cells Already Contain a Disease Allele

In a preferred embodiment, the precursor animals or cells alreadycontain a disease allele. As used herein, the term “disease allele”refers to an allele of a gene which is capable of producing arecognizable disease. A disease allele may be dominant or recessive andmay produce disease directly or when present in combination with aspecific genetic background or preexisting pathological condition. Adisease allele may be present in the gene pool or may be generated denovo in an individual by somatic mutation. For example and notlimitation, disease alleles include: activated oncogenes, a sickle cellanemia allele, a Tay-Sachs allele, a cystic fibrosis allele, aLesch-Nyhan allele, a retinoblastoma-susceptibility allele, a Fabry'sdisease allele, a Huntington's chorea allele, and a xenoderma pigmentosaallele. As used herein, a disease allele encompasses both allelesassociated with human diseases and alleles associated with recognizedveterinary diseases. For example, the deltaF508 CFTR allele in a humandisease allele which is associated with cystic fibrosis in NorthAmericans.

Once made and administered to target cells, new domains of genes may beisolated as outlined herein.

Alternatively, the target cells may be screened to identify a cell thatcontains the targeted functional domain sequence modification. This willbe done in any number of ways, and will depend on the target domain andtargeting polynucleotides as will be appreciated by those in the art.The screen may be based on phenotypic, biochemical, genotypic, or otherfunctional changes, depending on the target sequence. For example, IgElevels may be evaluated for inflammation or asthma; vascular tone orblood pressure can be evaluated for hypertension, behavior screens canbe done for neurologic effects, lipoprotein profiles can be screened forcardiovascular effects; secreted molecules can be evaluated forendocrine processes; CBCs can be done for hematology studies, etc. In anadditional embodiment, as will be appreciated by those in the art,selectable markers or marker sequences may be included in the targetingpolynucleotides to facilitate later identification.

The broad scope of this invention is best understood with reference tothe following examples, which are not intended to limit the invention inany manner. All patents, patent applications, and publications citedherein are expressly incorporated by reference in their entirety.

3.5.4. Gene Deletion in Bacteria

This invention relates to a method and means for deleting a gene from abacterial chromosome in a single step.

3.5.4.1 Applications

3.5.4.1.1 The Construction of Special Bacterial Strains which have aParticular Genetic Background

Many applications require the construction of special bacterial strainswhich have a particular genetic background. These genetic backgroundsare the framework in which specific recombinant DNA plasmidconstructions are tested to determine whether they can provide functionswhich are missing from the background of the bacteria. If such functionsare provided by the recombinant plasmid, then there is positive evidencethat a particular genetic locus or loci is encoded by the plasmid. Theconstruction of genetic backgrounds is therefore a vital step in thesubsequent cloning and investigation of specific genes.

3.5.4.1.2. The Gene in Question has been Deleted, so there is NoProduction whatsoever of a Mutant Protein Making Analysis Much LessAmbiguous

The most common backgrounds are those in which a single mutation ispresent in a specific gene on the bacterial chromosome. This may resultin synthesis of a defective version of the protein encoded by that gene,resulting in a specific cellular dysfunction. Correction of the cellulardysfunction, by introduction of a specific recombinant plasmid, isevidence that the relevant gene has been cloned onto the specificplasmid. However, a mutant protein may not be silent and may undergointeractions with other components, thereby creating the appearance thatthe plasmid gene encodes the entire active protein when it does not.This can seriously confuse the analysis. A much less ambiguous andtherefore more desirable approach is one in which the gene in questionhas been deleted. In these circumstances, there is no productionwhatsoever of a mutant protein.

3.5.4.13. If Bacteria is Being Used to Genetically Engineer a Protein,Deletion of the Gene that Manufactures a Contaminating Protein Made bythe Same Bacteria can Reduce Purification Steps

Another situation where it is desirable to delete a complete gene fromthe chromosome of a bacteria is when the bacteria is being used in theproduction of a genetically engineered protein. Examples of thesesituations include the expression of insulin, growth hormone, protein A,and various vaccines from recombinant genes inserted into E. coli. Manytimes the E. coli produces proteins which contaminate the purifiedproduct produced by the genetic engineering. Although it is possible toadd additional purification steps to remove this contaminant, it wouldbe preferable to avoid the problem entirely by deleting the geneencoding the contaminating protein. Methods that are presently used toalter a gene include random mutation or inactivation of the genesequence by mutation, insertion, or deletion of some portion of thegene. However, this can still lead to the production of inactive proteinfragments or deletion of more of the chromosome than is necessary ordesirable.

It is therefore an object of the present invention to provide a methodand means for deleting a specific gene from a bacteria.

It is another object of the present invention to provide a method andmeans for inserting and/or inactivating or deleting the recA gene in avariety of bacteria.

3.5.4.2. Miscellaneous Applications

3.5.4.2.1 Creating a Deletion in a Defined Target within the BacterialChromosome

There are two obstacles which have to be overcome. One is to develope amethod which will create a deletion in a defined target within thebacterial chromosome.

3.5.4.2.2 Generalizing the Approach so that Even Essential Genes can beDeleted

The second is to generalize the approach so that even essential genescan be deleted. This is because many of the genes of interest areessential ones. The deletion of an essential gene normally would resultin cell death.

Essential proteins includes enzymes of glycolysis, enzymes associatedwith amino acid or sugar biosynthesis, enzymes and factors associatedwith protein and nucleic acid biosynthesis (including both RNA and DNA),enzymes required for the synthesis of cofactors for oxidation,reduction, methylation and transamination processes, and enzymesnecessary for synthesis of essential lipids and polysaccharides or ofany other essential

molecule, including various nucleic acids, such as transfer or ribosomalRNAs, and segments of nucleic acids, such as gene regulatory elements.

It is therefore an object of the present invention to provide a methodwherein an organism is produced which does not contain genetic materialcoding for the molecule which is to be cloned and expressed in theorganism.

A further object of the present invention is to provide a method wherebya deficient organism which is to be used for cloning an essential generemains viable even under restrictive conditions.

3.5.43 Specialized Applications

3.5.43.1. Method for the Deletion of a Gene from a Bacteria Using aSingle Step Procedure that is Applicable to any Gene that has beenCloned

Disclosed is a method for the deletion of a gene from a bacteria using asingle step procedure that is applicable to any gene that has beencloned. The procedure depends upon site-directed recombination of linearDNA fragments with sequences on the chromosome as a function of recA incombination with the subsequent inactivation or deletion of the recAgene. The method is analogous to a procedure used to give insertions byhomologous recombination into specific plasmid genes.

3.5.43.1.1. Strategy for Construction of Chromosomal Deletions

The basic strategy for construction of chromosomal deletions is totransform the bacteria with linear DNA fragments which contain anantibiotic resistant or other phenotypically detectable gene segment (a“marker”) flanked by sequences homologous to a closely spaced region onthe cell chromosome containing the gene to be deleted A double-crossoverevent within the homologous sequences, effectively deleting the entiregene, is selected for by screening for the antibiotic resistantphenotype.

The linear fragment is not integrated into the chromosome in the absenceof enzymes expressed by the recA gene. Accordingly, if the gene isabsent or inactive, a recA gene must be inserted into the cell prior tothe linear recombination event, and then inactivated or removed toprevent subsequent incorporation of other non-chromosomal sequences intothe chromosome. This is particularly important if the bacteria is usedas a host for the expression of genetically engineered proteins fromsequences carried on plasmids or other extrachromosomal elements. TherecA gene can be provided either in the form of an extrachromosomalelement such as a plasmid or through incorporation of the gene into thechromosome. The recA gene is preferably inactivated or deleted by meansof a double reciprocal recombination event utilizing linear sequencescontaining sequences homologous to the flanking sequences on either sideof the recA gene in the chromosome. This is essentially the same methodused to delete or insert a gene into the chromosome by homologousrecombination as described above.

The present invention includes isolated linear DNA fragments constructedfor use in the method for deleting a gene from the chromosome and forinserting or deleting the recA gene.

The method and sequences are applicable to a variety of bacteriaincluding strains of Escherichia, Pseudomonas, Agrobacterium, Proteus,Erwinia, Shigella, Bacillus, Rhizobium, Vibrio, Salmonella,Streptococcus, and Haemophilus. A plasmid which has atemperature-sensitive replicon and a wild-type allele of the desiredgene is used to restore or maintain the phenotype produced by thedeleted gene. This plasmid maintains production of the desired protein,and therefore cell viability if the encoded protein is essential to cellgrowth, when the chromosomal copy of the desired gene has been deleted.However, since the resulting cells have a temperature-sensitivephototype, the expression of the plasmid gene may be easily prevented byculturing the host strain at an elevated temperature. The resultingdeficient host strain may then be used to screen other mutated andcloned genes for their ability to produce the desired protein.

3.5.5. Additional Considerations

3.5.5.1. Maintaining the Viability of the Bacteria

The present invention is a method for deleting any gene from a bacterialstrain, while maintaining the viability of the bacteria if the geneencodes an essential molecule and deletion of the essential gene resultsin a lethal phenotype. The gene to be deleted is provided on a plasmidwith a temperature sensitive replicon. The cells now have a temperaturesensitive phenotype. When the cells are grown at an elevated temperatureunder conditions which allow rapid detection of the absence of thedesired molecule, complementation of this phenotype by introduction of aDNA fragment fused onto a stable plasmid is strong evidence for cloningof the gene which has been deleted from the chromosome.

3.5.5.2. Homologous Recombination

The present invention is a method for deleting any gene from a bacterialstrain employing linear DNA fragments incorporating sequences homologousto the sequences flanking the gene to be deleted in the chromosome andsequences allowing insertion or removal/inactivation of recA in avariety of bacteria.

3.5.5.2.1. Roles for Several Enzymes Needed in Recombination

Homologous recombination has been detected in a wide variety oforganisms, from simple bacteriophages to complex eukaryotic cells.Genetic and biochemical investigations have defined roles for severalenzymes needed in recombination.

3.5.5.2.2. RecA

3.5.5.2.2.1. Alignment of DNA Molecules before Exchange

The RecA protein participates in the early steps of synapse, allowingalignment of DNA molecules before exchange, in strand transfer, wherethere is transfer of a single-stranded segment to a recipient duplex toform a limited heteroduplex region between the interacting DNAs and inthe extension of this heteroduplex region by a reaction involving theconcerted winding and unwinding of incoming and outgoing DNA chains,respectively. The hydrolysis of ATP by RecA protein is required forthese events in vitro.

3.5.5.2.2.2. Controlling Expression of a Group of Unlinked Genes thatAid in Recovery of Cells after Exposure to DNA Damaging Agents

The recA gene performs another equally important role in cell metabolismby controlling expression of a group of unlinked genes that aid inrecovery of cells after exposure to DNA-damaging agents. This response,termed the SOS response, involves genes that participate in repair ofDNA damage, mutagenesis, and coordination of cell division events.

3.5.5.2.2.3. Characterization

The purified RecA protein is a single polypeptide ranging in weight fromabout 37,000 to about 42,000. Although there is some variation in thesequences between bacterial strains, the recA proteins from a variety ofbacteria in general have been isolated and characterized by interspeciescomplementation and assays utilizing comparisons with isolated,characterized proteins.

The present method and the linear fragments for use in the presentmethod are not limited to organisms such as E. coli. The recA gene isfound in a variety of bacteria including both gram negative and grampositive organisms. The recA gene has been isolated or identified andcharacterized in several organisms. It is possible to prepare a genomiclibrary from any bacterial species and to isolate a clone containing asequence homologous to a characterized recA gene by interspecificcomplementation using one of the available DNA clones containing a recAsequence or cross-reactivity with antisera to RecA proteins from awell-characterized organism such as E. coli.

RecA+ and RecA-strains are available from a variety of sources. Forexample, cloning and characterization of recA genes and recA proteinsfrom Proteus vulgaris, Erwinia carotovoria, Shigella flexneria andEscherichia coli are described by S. L. Keener, et al., in 1. Bacter.160(1), 153-160 (1984). The RecA proteins produced by these organismswere demonstrated to be highly conserved among the species. In fact, theprotein produced by one species could be introduced into another specieswhere it complemented repair and regulatory defects of recA mutations.Other bacterial recA genes and gene products have been described by C.A. Miles, et al, in Mol. Gen. Genet. 204,161-165 (1986) (Agrobacteriumtumefaciens C58), 1. Goldberg et al, J. Bacteriol. 165(3),715-722(1986), (Vibrio cholera), M. Better et al, J. Bacteriol. 155(1),311-316 (1983), (Rhizobium meliloti), T. A. Kokjohn et al, J. Bacteriol.163(2), 568-572 (1985), (Pseudomonas aeruginosa), and C. M. Lovett, Jr.et al, J. Bio. Chem. 260(6), 3305-3313 (1985) (Bacillus subtilis). Thesearticles detail the isolation and characterization of gene libraries andthe proteins encoded by the recA genes using techniques known to thoseskilled in the art including construction of gene libraries,identification of homologous genes using hybridization to probes fromother more well characterized species such as E. coli, isolation andcharacterization of RecA proteins using antisera to RecA proteins fromE. coli, and interspecies complementation of deficient strains of E.coli using gene segments from the libraries. The isolated proteins wereuseful for in vitro complementation studies. RecA deficient strains andRecA clones are available from many of the laboratories cited in theabove articles and from the E. coli Genetic Stock Center at YaleUniversity run by Dr. Barbara Bachman.

3.5.5.2.3. Construction of Linear DNA Fragments which have SequencesHomologous to Closely Spaced Regions on the Chromosome in the Bacteriawhich Flank Each Side of the Gene of Interest

The method whereby the DNA fragment is introduced into the chromosome todelete a gene or to regulate recA involves the construction of linearDNA fragments which have sequences homologous to closely spaced regionson the chromosome in the bacteria which flank each side of the gene ofinterest. In general, they must contain at least 80 to 100 nucleotideshomologous to sequences flanking the gene to be deleted or the recAgene. An antibiotic resistant locus such as Kan^(r), which encodes aprotein making the bacterial resistant to the antibiotic, or othermarker, is placed between these flanking sequences.

The linear DNA segment is introduced into a specific cell strain andselection is done for a double, reciprocal recombination which willdelete the target gene and insert the marker into its place. As a resultof the insertion of the antibiotic resistant gene, the cells are nowresistant to the antibiotic. Cells which do not contain the insertionare eliminated by growing the bacteria in a medium containing theantibiotic.

Any other gene which allows for rapid screening of the cells containinga double, reciprocal recombination may be used in place of a gene forantibiotic resistance. For example, any gene for an essential protein,including enzymes, cofactors, proteins which are necessary for thesynthesis of essential lipids, polysaccharides, nucleic acids, and otherprotein molecules such as receptors, as well as nucleic acids which havefunctional activity such as ribozymes, may be used. Other genes whichconfer a detectable phenotype on the cell strain such as sensitivity totemperature or ultraviolet radiation, auxotrophism for a sugar, aminoacid, protein or nucleotide, or any other phenotype which can bedetected by chemical indicators either in vitro or in vivo assay, or animmunoassay for a specific cellular component may also be used. Suchchemical, radioactive, or immunological screening assays are well knownto those skilled in the art.

3.5.53. Eliminating Contamination from the Host Producing its OwnProtein by Deletion and Replacement

In one application, in which the goal is to produce and purify a foreignprotein, and the microorganism encodes its own version of the protein,the gene for the microorganism's own protein is eliminated. The gene forthe foreign protein is then inserted and the protein produced. Thepurification process is thereby simplified since there is nocontamination by the host protein, whether analogous to the proteinbeing produced or unrelated which copurifies with, or interferes withthe purification of, the protein being produced. For example, a proteinmay interfere with binding of the protein to be purified to a column.

3.5.5.4. A Plasmid is Used to Introduce a Mutated Gene into an Organism

In a second application, a plasmid is used to introduce a gene into anorganism which typically contains a mutation in the gene to beinvestigated resulting in a negative phenotype for the product of thegene to be investigated. Failure of the organism to produce abiologically active form of the protein encoded by the mutated gene mayconfer a lethal phenotype under certain defined conditions. For example,this can be at a temperature, designated as the restrictive temperature,at which the mutant protein denatures or otherwise undergoesinactivation. The cloned gene which is introduced is selected for byvirtue of its ability to confer cell viability or any other detectablephenotype for the desired protein at the restrictive temperature. Theacquisition of viability or other detectable phenotype at the normallyrestrictive temperature is evidence that the gene of interest has beencloned.

3.5.5.4.1. Problems: The Defective Host Protein May Interact With aProtein Produced from the Introduced Plasmid

The major problem with this second system is that, unless the inactivegene is deleted in entirety, the defective host protein may interactwith a protein produced from the introduced plasmid. This interactionmay stabilize the defective host Protein enough so that its activity isrestored even at the restrictive temperature. In this case, therestoration of growth at the restrictive temperature would be a falsepositive, that is, the growth would not be due to activity encoded bythe cloned DNA segment. This problem holds true for all selections basedon complementation of a phenotype which is due to a defect in a specificprotein. Such phenotypes include temperature sensitivity, amino acidauxotrophies or any other auxotrophies which result from a lack ofsynthesis of a key ingredient such as a sugar, nucleotide, criticalprotein or nucleic acid, or cofactor used for oxidation, reduction, ortransamination reactions.

3.5.5.5. False Positives

The problem of “false positives” also exists for cloned DNA pieces whichare created to encode enzyme fragments as a means to define thecatalytic core or to define any segment which achieves a specificpurpose, such as a piece which undergoes self-association, binds to aspecific ligand or receptor, or forms a specific complex or array withone or more additional components. In these cases, the engineering ofprotein fragments, which are tested in a host cell that encodes adefective version of the protein of interest, is seriously hampered ifthe defective host protein interacts in any way with the engineeredpieces.

3.5.5.6. Use of Temperature Sensitive Replicons

In the present invention, specifically designed linear DNA fragments areused to create a deletion of a gene by site-specific recombination.These fragments are transformed into the host cell. Cell viability orthe detectable phenotype can be maintained during the procedure byprovision of the gene encoding the desired protein on a recombinantplasmid that has a temperature-sensitive replicon, so that the cellswhich contain the deletion have a temperature sensitive phenotype. Toachieve the deletion by recombination with the linear DNA fragments, itis necessary for the cells to have a RecA+ phenotype which is derivedfrom recA, or its equivalent. Once recombination has occurred, the cellmust immediately be changed to RecA or else the temperature sensitiveplasmid will recombine with homologous sequences on the chromosome. Thesame would apply to any other extrachromosomal element where integrationinto the host chromosome would be undesirable.

The RecA− phenotype may be achieved by simultaneous inactivation of recAduring the transformation with linear fragments or, after thetransformation, by immediately introducing RecA− by mating with anappropriate RecA− strain or by transduction with a phage which carries aRecA− gene segment. Although mutagenesis may also be an effective meansof making the cell RecA−, this is a “hit or miss” approach. Thepreferred method is to use homologous recombination of linear DNAsequences bounded by sequences hybridizing to the sequences flanking therecA gene. The recA gene is necessary in order for the gene encoding thedesired protein to be incorporated into the organism. However, anyplasmids or other extrachromosomal elements in the cell will beincorporated unless the recA gene is immediately removed. This is aparticular concern where the bacteria serves as a host for theexpression of a genetically engineered protein from multicopy plasmids.

3.6. Artificial Chromosomes

3.6.1 Artificial Chromosomes, Uses Thereof and Methods for PreparingArtificial Chromosomes

Methods for preparing cell lines that contain artificial chromosomes,methods for preparation of artificial chromosomes, methods for isolationof the artificial chromosomes, methods for purification of artificialchromosomes, methods for targeted insertion of heterologous DNA intoartificial chromosomes, methods for delivery of the chromosomes toselected cells and tissues, and methods for isolation and large-scaleproduction of the chromosomes are provided. Also provided are cell linesfor use in the methods, and cell lines and chromosomes produced by themethods. In particular, satellite artificial chromosomes that, exceptfor inserted heterologous DNA, are substantially composed ofheterochromatin are provided. Cell-based methods for use of theartificial chromosomes, including for gene therapy, production of geneproducts and production of transgenic plants and animals are alsoprovided.

3.6.2 Limitations of Existing Gene Delivery Technologies

Several viral vectors, non-viral, and physical delivery systems for genetherapy and recombinant expression of heterologous nucleic acids havebeen developed [see, e.g., Mitani et al. (1993) Trends Biotech. 11:162-166]. The presently available systems, however, have numerouslimitations, particularly where persistent, stable, or controlled geneexpression is required. These limitations include: (1) size limitationsbecause there is a limit, generally on order of about ten kilobases[kB], at most, to the size of the DNA insert [gene] that can be acceptedby viral vectors, whereas a number of mammalian genes of possibletherapeutic importance are well above this limit, especially if allcontrol elements are included; (2) the inability to specifically targetintegration so that random integration occurs which carries a risk ofdisrupting vital genes or cancer suppressor genes; (3) the expression ofrandomly integrated therapeutic genes may be affected by the functionalcompartmentalization n the nucleus and are affected by chromatin-basedposition effects; (4) the copy number and consequently the expression ofa given gene to be integrated into the genome cannot be controlled.Thus, improvements in gene delivery and stable expression systems areneeded [see, e.g., Mulligan (1993) Science 260: 926-932].

In addition, safe and effective vectors and gene therapy methods shouldhave numerous features that are not assured by the presently availablesystems. For example, a safe vector should not contain DNA elements thatcan promote unwanted changes by recombination or mutation in the hostgenetic material, should not have the potential to initiate deleteriouseffects in cells, tissues, or organisms carrying the vector, and shouldnot interfere with genomic functions. In addition, it would beadvantageous for the vector to be non-integrative, or designed forsite-specific integration. Also, the copy number of therapeutic gene(s)carried by the vector should be controlled and stable, the vector shouldsecure the independent and controlled function of the introducedgene(s); and the vector should accept large (up to Mb size) inserts andensure the functional stability of the insert.

The limitations of existing gene delivery technologies, however, arguefor the development of alternative vector systems suitable fortransferring large [up to Mb size or larger] genes and gene complexestogether with regulatory elements that will provide a safe, controlled,and persistent expression of the therapeutic genetic material.

At the present time, none of the available vectors fulfill all theserequirements. Most of these characteristics, however, are possessed bychromosomes. Thus, an artificial chromosome would be an ideal vector forgene therapy, as well as for stable, high-level, controlled productionof gene products that require coordination of expression of numerousgenes or that are encoded by large genes, and other uses. Artificialchromosomes for expression of heterologous genes in yeast are available,but construction of defined mammalian artificial chromosomes has notbeen achieved. Such construction has been hindered by the lack of anisolated, functional, mammalian centromere and uncertainty regarding therequisites for its production and stable replication. Unlike in yeast,there are no selectable genes in close proximity to a mammaliancentromere, and the presence of long runs of highly repetitivepericentric heterochromatic DNA makes the isolation of a mammaliancentromere using presently available methods, such as chromosomewalking, virtually impossible. Other strategies are required forproduction of mammalian artificial chromosomes, and some have beendeveloped. For example, U.S. Pat. No. 5,288,625 provides a cell linethat contains an artificial chromosome, a minichromosome, that is about20 to 30 megabases. Methods provided for isolation of these chromosomes,however, provide preparations of only about 10-20% purity. Thus,development of alternative artificial chromosomes and perfection ofisolation and purification methods as well as development of moreversatile chromosomes and further characterization of theminichromosomes is required to realize the potential of this technology.

Therefore, it is an object herein to provide mammalian artificialchromosomes and methods for introduction of foreign DNA into suchchromosomes. It is also an object herein to provide methods of isolationand purification of the chromosomes. It is also an object herein toprovide methods for introduction of the mammalian artificial chromosomeinto selected cells, and to provide the resulting cells, as well astransgenic animals, birds, fish and plants that contain the artificialchromosomes. It is also an object herein to provide methods for genetherapy and expression of gene products using artificial chromosomes. Itis a further object herein to provide methods for constructingspecies-specific artificial chromosomes de novo. Another object hereinis to provide methods to generate de novo mammalian artificialchromosomes.

3.6.3Methods for Preparing Artificial Chromosomes

Mammalian artificial chromosomes [MACs] are provided. Also provided areartificial chromosomes for other higher eukaryotic species, such asinsects, birds, fowl and fish, produced using the MACS and methodsprovided herein. Methods for generating and isolating such chromosomesare provided.

Methods using the MACs to construct artificial chromosomes from otherspecies, such as insect, bird, fowl and fish species are also provided.The artificial chromosomes are fully functional stable chromosomes. Twotypes of artificial chromosomes are provided. One type, herein referredto as SATACs [satellite artificial chromosomes] are stableheterochromatic chromosomes, and the other type are minichromosomesbased on-amplification of euchromatin.

Artificial chromosomes provide an extra-genomic locus for targetedintegration of megabase pair size DNA fragments that contain single ormultiple genes, including multiple copies of a single gene operativelylinked to one promoter or each copy or several copies linked to separatepromoters. Thus, methods using the MACs to introduce the genes intocells, tissues, and animals, as well as species such as birds, fowl,fish and plants, are also provided. The artificial chromosomes withintegrated heterologous DNA may be used in methods of gene therapy, inmethods of production of gene products, particularly products thatrequire expression of multigenic biosynthetic pathways, and also areintended for delivery into the nuclei of germline cells, such asembryo-derived stem cells [ES cells], for production of transgenicanimals, birds, fowl and fish. Transgenic plants, including monocots anddicots, are also contemplated herein.

Mammalian artificial chromosomes provide extra-genomic specificintegration sites for introduction of genes encoding proteins ofinterest and permit megabase size DNA integration so that, for example,genes encoding an entire metabolic pathway or a very large gene, such asthe cystic fibrosis [CF; about 250 kb] genomic DNA gene [cystic fibrosis[CF; about 600 kb] gene], several genes, such as multiple genes encodinga series of antigens for preparation of a multivalent vaccine, can bestably introduced into a cell. Vectors for targeted introduction of suchgenes, including the tumor suppressor genes, such as p53, the cysticfibrosis transmembrane regulator cDNA [CFTR], and the genes for anti-HIVribozymes, such as an anti-HIV gag ribozyme gene, into the artificialchromosomes are also provided.

The chromosomes provided herein are generated by introducingheterologous DNA that includes DNA encoding one or multiple selectablemarker(s) into cells, preferably a stable cell line, growing the cellsunder selective conditions, and identifying from among the resultingclones those that include chromosomes with more than one centromereand/or fragments thereof. The amplification that produces the additionalcentromere occurs in cells that contain chromosomes in which theheterologous DNA has integrated near the centromere in the pericentricregion of the chromosome. The selected clonal cells are then used togenerate artificial chromosomes.

In preferred embodiments, the DNA with the selectable marker that isintroduced into cells to generate artificial chromosomes includessequences that target it to the pericentric region of the chromosome.For example, vectors, such as pTEMPUD, which includes such DNA specificfor mouse satellite DNA, are provided. Also provided are derivatives ofpTEMPUD containing human satellite DNA sequences that specificallytarget human chromosomes or human satellite sequences. Upon integration,these vectors can induce the amplification that results in generation ofadditional centromeres.

Artificial chromosomes are generated by culturing the cells with themulti-centric, typically dicentric, chromosomes under conditions wherebythe chromosome breaks to form a minichromosome and formerly dicentricchromosome. Among the MACs provided herein are the SATACs, which areprimarily made up of repeating units of short satellite DNA and arefully heterochromatic, so that without insertion of heterologous orforeign DNA, the chromosomes preferably contain no genetic information.They can thus be used as “safe” vectors for delivery of DNA to mammalianhosts because they do not contain any potentially harmful genes. TheSATACs are generated, not from the minichromosome fragment as, forexample, in U.S. Pat. No. 5,288,625, but from the fragment of theformerly dicentric chromosome. In addition, methods for generatingeuchromatic minichromosomes and the use thereof are also providedherein. Methods for generating one type of MAC, the minichromosome,previously described in U.S. Pat. No. 5,288,625, and the use thereof forexpression of heterologous DNA are provided. Cell lines containing theminichrombsome and the use thereof for cell fusion are also provided.

In one embodiment, a cell line containing the mammalian minichromosomeis used as recipient cells for donor DNA encoding a selected gene ormultiple genes. To facilitate integration of the donor DNA into theminichromosome, the recipient cell line preferably contains theminichromosome but does not also contain the formerly dicentricchromosome. This may be accomplished by methods disclosed herein such ascell fusion and selection of cells that contain a minichromosome and noformerly dicentric chromosome. The donor DNA is linked to a secondselectable marker and is targeted to and integrated into theminichromosome. The resulting chromosome is transferred by cell fusioninto an appropriate recipient cell line, such as a Chinese hamster cellline [CHO]. After large-scale production of the cells carrying theengineered chromosome, the chromosome is isolated. In particular,metaphase chromosomes are obtained, such as by addition of colchicine,and they are purified from the cell lysate. These chromosomes are usedfor cloning, sequencing and for delivery of heterologous DNA into cells.

Also provided are SATACs of various sizes that are formed by repeatedculturing under selective conditions and subcloning of cells thatcontain chromosomes produced from the formerly dicentric chromosomes.The exemplified SATACs are based on repeating NA units that are about 15Mb [two about 7.5 Mb blocks]. The repeating DNA unit of SATACs formedfrom other species and other chromosomes may vary, but typically wouldbe on the order of about 7 to about 20 Mb. The repeating DNA units arereferred to herein as megareplicons, which in the exemplified SATACscontain tandem blocks of satellite DNA flanked by non-satellite DNA,including heterologous DNA and non-satellite DNA. Amplification producesan array of chromosome segments [each called an amplicon] that containtwo inverted megareplicons bordered by heterologous [“foreign”] DNA.Repeated cell fusion, growth on selective medium and/or BrdU[5-bromodeoxyuridine] treatment or other treatment with other genomedestabilizing reagent or agent, such as ionizing radiation, includingX-rays, and subcloning results in cell lines that carry stableheterochromatic or partially heterochromatic chromosomes, including a150-200 Mb “sausage” chromosome, a 500-1000 Mb gigachromosome, a stable250400 Mb megachromosome and various smaller stable chromosomes derivedtherefrom. These chromosomes are based on these repeating units and caninclude heterologous DNA that is expressed.

Thus, methods for producing MACs of both types (i.e., SATACS andminichromosomes) are provided. These methods are applicable to theproduction of artificial chromosomes containing centromeres derived fromany higher eukaryotic cell, including mammals, irds, fowl, fish, insectsand plants.

The resulting chromosomes can be purified by methods provided herein toprovide vectors for introduction of heterologous DNA into selected cellsfor production of the gene product(s) encoded by the heterologous DNA,for production of transgenic animals, birds, fowl, fish and plants orfor gene therapy.

In addition, methods and vectors for fragmenting the minichromosomes andSATACs are provided. Such methods and vectors can be used for in vivogeneration of smaller stable artificial chromosomes. Vectors forchromosome fragmentation are used to produce an artificial chromosomethat contains a megareplicon, a centromere and two telomeres and will bebetween about 7.5 Mb and about 60 Mb, preferably between about 10 Mb-15Mb and 30-50 Mb. As exemplified herein, the preferred range is betweenabout 7.5 Mb and 50 Mb. Such artificial chromosomes may also be producedby other methods.

Isolation of the 15 Mb [or 30 Mb amplicon containing two 15 Mb invertedrepeats] or a 30 Mb or higher multimer, such as 60 Mb, thereof shouldprovide a stable chromosomal vector that can be manipulated in vitro.Methods for reducing the size of the MACs to generate smaller stableself-replicating artificial chromosomes are also provided.

Methods and vectors for targeting heterologous DNA into the artificialchromosomes are also provided as are methods and vectors for fragmentingthe chromosomes to produce smaller but stable and self-replicatingartificial chromosomes.

The chromosomes are introduced into cells to produce stable transformedcell lines or cells, depending upon the source of the cells.Introduction is effected by any suitable method including, but notlimited to electroporation, direct uptake, such as by calcium phosphateprecipitation, uptake of isolated chromosomes by lipofection, bymicrocell fusion, by lipid-mediated carrier systems or other suitablemethod. The resulting cells can be used for production of proteins inthe cells. The chromosomes can be isolated and used for gene delivery.

Methods for isolation of the chromosomes based on the DNA content of thechromosomes, which differs in MACs versus the authentic chromosomes, areprovided.

These artificial chromosomes can be used in gene therapy, gene productproduction systems, production of humanized genetically transformedanimal organs, production of transgenic plants and animals, includingmammals, birds, fowl, fish, invertebrates, vertebrate, reptiles andinsects, any organism or device that would employ chromosomal elementsas information storage vehicles, and also for analysis and study ofcentromere function, for the production of artificial chromosome vectorsthat can be constructed in vitro, and for the preparation ofspecies-specific artificial chromosomes. The artificial chromosomes canbe introduced into cells using microinjection, cell fusion, microcellfusion, electroporation, electrofusion, projectile bombardment, calciumphosphate precipitation, site-specific targeting, lipid-mediatedtransfer systems and other such methods. Cells particularly suited foruse with the artificial chromosomes include, but are not limited toplant cells, particularly tomato, arabidopsis, and others, insect cells,including silk worm cells, insect larvae, fish, reptiles, amphibians,arachnids, mammalian cells, avian cells, embryonic stem cells,haematopoietic stem cells, embryos and cells for use in methods ofgenetic therapy, such as lymphocytes that are used in methods ofadoptive immunotherapy and nerve or neural cells. Thus methods ofproducing gene products and transgenic animals and plants are provided.Also provided are the resulting transgenic animals and plants.

Exemplary cell lines that contain these chromosomes are also provided.

Methods for preparing artificial chromosomes for particular species andfor cloning centromeres are also provided. For example, two methods forgenerating artificial chromosomes for use in different species areprovided. First, the methods herein may applied to different species.Second, means for generating species-specific artificial chromosomes andfor cloning centromeres are provided. In particular, a method forcloning a centromere from an animal or plant by preparing a library ofDNA fragments that contain the genome of the plant or animal,introducing each of the fragments into a mammalian satellite artificialchromosome [SATAC] that contains a centromere from a different species,generally a mammal, from the selected plant or animal, generally anon-mammal, and a selectable marker. The selected plant or animal is onein which the mammalian species centromere does not function. Each of theSATACs is introduced into the cells, which are grown under selectiveconditions, and cells with SATACs are identified. Such SATACS shouldcontain a centromere encoded by the DNA from the library or shouldcontain the necessary elements for stable replication in the selectedspecies.

Also provided are libraries in which the relatively large fragments ofDNA are contained on artificial chromosomes.

Transgenic animals, invertebrates and vertebrates, plants and insects,fish, reptiles, amphibians, arachnids, birds, fowl, and mammals are alsoprovided. Of particular interest are transgenic animals that expressgenes that confer resistance or reduce susceptibility to disease. Sincemultiple genes can be introduced on a MAC, a series of genes encoding anantigen can be introduced, which upon expression will serve to immunize[in a manner similar to a multivalent vaccine) the host animal againstthe diseases for which exposure to the antigens provide immunity or someprotection.

Also of interest are transgenic animals that serve as models of certaindiseases and disorders for use in studying the disease and developingtherapeutic treatments and cures thereof. Such animal models of diseaseexpress genes [typically carrying a disease-associated mutation], whichare introduced into the animal on a MAC and which induce the disease ordisorder in the animal. Similarly, MACs carrying genes encodingantisense RNA may be introduced into animal cells to generateconditional “knock-out” transgenic animals. In such animals, expressionof the antisense RNA results in decreased or complete elimination of theproducts of genes corresponding to the antisense RNA. Of furtherinterest are transgenic mammals that harbor MAC-carried genes encodingtherapeutic proteins that are expressed in the animal's milk. Transgenicanimals for use in xenotransplantation, which express MAC— carried genesthat serve to humanize the animal's organs, are also of interest. Genesthat might be used in humanizing animal organs include those encodinghuman surface antigens.

Methods for cloning centromeres, such as mammalian centromeres, are alsoprovided. In particular, in one embodiment, a library composed offragments of SATACs are cloned into YACs [yeast artificial chromosomes]that include a detectable marker, such as DNA encoding tyrosinase, andthen introduced into mammalian cells, such as albino mouse embryos. Miceproduced from embryos containing such YACs that include a centromerethat functions in mammals will express the detectable marker. Thus, ifmice are produced from albino mouse embryos into which a functionalmammalian centromere was introduced, the mice will be pigmented or haveregions of pigmentation.

3.6.4 Particularly Relevant Definitions

Unless defined otherwise, all technical and scientific terms used hereinhave the same meaning as is commonly understood by one of skill in theart to which this invention belongs. All patents and publicationsreferred to herein are incorporated by reference.

As used herein, a mammalian artificial chromosome [MAC] is a piece ofDNA that can stably replicate and segregate alongside endogenouschromosomes. It has the capacity to accommodate and express heterologousgenes inserted therein. It is referred to as a mammalian artificialchromosome because it includes an active mammalian centromere(s). Plantartificial chromosomes, insect artificial chromosomes and avianartificial chromosomes refer to chromosomes that include plant andinsect centromeres, respectively. A human artificial chromosome [HAC]refers to chromosomes that include human centromeres, BUGACs refer toinsect artificial chromosomes, and AVACs refer to avian artificialchromosomes.

As used herein, stable maintenance of chromosomes occurs when at leastabout 85%, preferably 90%, more preferably 95%, of the cells retain thechromosome. Stability is measured in the presence of a selective agent.Preferably these chromosomes are also maintained in the absence of aselective agent. Stable chromosomes also retain their structure duringcell culturing, suffering neither intrachromosomal nor interchromosomalrearrangements.

As used herein, growth under selective conditions means growth of a cellunder conditions that require expression of a selectable marker forsurvival.

As used herein, euchromatin and heterochromatin have their recognizedmeanings, euchromatin refers to DNA that contains genes, andheterochromatin refers to chromatin that has been thought to beinactive. Highly repetitive DNA sequentes [satellite DNA], at least withrespect to mammalian cells, are usually located in regions ofcentromeric heterochromatin [pericentric heterochromatin]. Constitutiveheterochromatin refers to heterochromatin that contains the highlyrepetitive DNA which is constitutively condensed and geneticallyinactive.

As used herein, BrdU refers to 5-bromodeoxyuridine, which duringreplication is inserted in place of thymidine. BrdU is used as amutagen; it also inhibits condensation of metaphase chromosomes duringcell division.

As used herein, a dicentric chromosome is a chromosome that contains twocentromeres. A multicentric chromosome contains more than twocentromeres.

As used herein, a formerly dicentric chromosome is a chromosome that isproduced when a dicentric chromosome fragments and acquires newtelomeres so that two chromosomes, each having one of the centromeres,are produced. Each of the fragments are replicable chromosomes. If oneof the chromosomes undergoes amplification of euchromatic DNA to producea full functionally chromosome that contains the newly introducedheterologous DNA and primarily [at least more than 50%] euchromatin, itis a minichromosome. The remaining chromosome is a formerly dicentricchromosome. If one of the chromosomes undergoes amplification, wherebyheterochromatin (satellite DNA] is amplified and a euchromatic portion[or arm] remains, it is referred to as a sausage hromosome. A chromosomethat is substantially all heterochroinatin, except for portions ofheterologous DNA, is called a SATAC. Such chromosomes [SATACs] can beproduced from sausage chromosomes by culturing the cell containing thesausage chromosome under conditions, such as BrdU treatment and/orgrowth under selective conditions, that destabilize the chromosome sothat a satellite artificial chromosomes [SATAC] is produced. Forpurposes herein, it is understood that SATACs may not necessarily beproduced in multiple steps, but may appear after the initialintroduction of the heterologous DNA and growth under selectiveconditions, or they may appear after several cycles of growth underselective conditions and BrdU treatment.

As used herein an amplicon is a repeated DNA amplification unit thatcontains a set of inverted repeats of the megareplicon. A megarepliconrepresents a higher order replication unit. For example, with referenceto the SATACs, the megareplicon contains a set of tandem DNA blocks eachcontaining satellite DNA flanked by non-satellite DNA. Contained withinthe megareplicon is a primary replication site, referred to as themegareplicator, which may be involved in organizing and facilitatingreplication of the pericentric heterochromatin and possibly thecentromeres. Within the megareplicon there may be smaller [e.g., 50-300kb in some mammalian cells] secondary replicons. In the exemplifiedSATACS, the megareplicon is defined by two tandem about 7.5 Mb DNAblocks [see, e.g., FIG. 3]. Within each artificial chromosome [AC] oramong a population thereof, each amplicon has the same gross structurebut may contain sequence variations. Such variations will arise as aresult of movement of mobile genetic elements, deletions or insertionsor mutations that arise, particularly in culture. Such variation doesnot affect the use of the ACs or their overall structure as describedherein.

As used herein, the minichromosome refers to a chromosome derived from amulticentric, typically dicentric, chromosome [see, e.g., FIG. 1] thatcontains more euchromatic than heterochromatic DNA.

As used herein, a megachromosome refers to a chromosome that, except forintroduced heterologous DNA, is substantially composed ofheterochromatin. Megachromosomes are made of an array of repeatedamplicons that contain two inverted megareplicons bordered by introducedheterologous DNA [see, e.g., FIG. 3 for a schematic drawing of amegachromosome]. For purposes herein, a megachromosome is about 50 to400 Mb, generally about 250400 Mb. Shorter variants are also referred toas truncated megachromosomes [about 90 to 120 or 150 Mb], dwarfmegachromosomes [about 150-200 Mb] and cell lines, and amicro-megachromosome [about 60-90 Mb]. For purposes herein, the termmegachromosome refers to the overall repeated structure based on anarray of repeated chromosomal segments [amplicons] that contain twoinverted megareplicons bordered by any inserted heterologous DNA. Thesize will be specified.

As used herein, genetic therapy involves the transfer or insertion ofheterologous DNA into certain cells, target cells, to produce specificgene products that are involved in correcting or modulating disease. TheDNA is introduced into the selected target cells in a manner such thatthe heterologous DNA is expressed and a product encoded thereby isproduced. Alternatively, the heterologous DNA may in some manner mediateexpression of DNA that encodes the therapeutic product. It may encode aproduct, uch as a peptide or RNA, that in some manner mediates, directlyor indirectly, expression of a therapeutic product. Genetic therapy mayalso be used to introduce therapeutic compounds, such as TNF, that arenot normally produced in the host or that are not produced intherapeutically effective amounts or at a therapeutically useful time.Expression of the heterologous DNA by the target cells within anorganism afflicted with the disease thereby enables modulation of thedisease. The heterologous DNA encoding the therapeutic product may bemodified prior to introduction into the cells of the afflicted host inorder to enhance or otherwise alter the product or expression thereof.

As used herein, heterologous or foreign DNA and RNA are usedinterchangeably and refer to DNA or RNA that does not occur naturally aspart of the genome in which it is present or which is found in alocation or locations in the genome that differ from hat in which itoccurs in nature. It is DNA or RNA that is not endogenous to the celland has been exogenously introduced into the cell. Examples ofheterologous DNA include, but are not limited to, DNA that encodes agene product or gene product(s) of interest, introduced for purposes ofgene therapy or for production of an encoded protein. Other examples ofheterologous DNA include, but are not limited to, DNA that encodestraceable marker proteins, such as a protein that confers drugresistance, DNA that encodes therapeutically effective substances, suchas anti-cancer agents, enzymes and hormones, and DNA that encodes othertypes of proteins, such as antibodies. Antibodies that are encoded byheterologous DNA may be secreted or expressed on the surface of the cellin which the heterologous DNA has been introduced.

As used herein, a therapeutically effective product is a product that isencoded by heterologous DNA that, upon introduction of the DNA into ahost, a product is expressed that effectively ameliorates or eliminatesthe symptoms, manifestations of an inherited or acquired disease or thatcures said disease.

As used herein, transgenic plants refer to plants in which heterologousor foreign DNA is expressed or in which the expression of a genenaturally present in the plant has been altered.

As used herein, operative linkage of heterologous DNA to regulatory andeffector sequences of nucleotides, such as promoters, enhancers,transcriptional and translational stop sites, and other signal sequencesrefers to the relationship between such DNA and such sequences ofnucleotides. For example, operative linkage of heterologous DNA to apromoter refers to the physical relationship between the DNA and thepromoter such that the transcription of such DNA is initiated from thepromoter by an RNA polymerase that specifically recognizes, binds to andtranscribes the DNA in reading frame.

As used herein, isolated, substantially pure DNA refers to DNA fragmentspurified according to standard techniques employed by those skilled inthe art, such as that found in Maniatis et al. [(1982) MolecularCloning: A Laboratory Manual, Cold Spring Harbor Laboratory Press, ColdSpring Harbor, N.Y.].

As used herein, expression refers to the process by which nucleic acidis transcribed into mRNA and translated into peptides, polypeptides, orproteins. If the nucleic acid is derived from genomic DNA, expressionmay, if an appropriate eukaryotic host ell or organism is selected,include splicing of the mRNA.

As used herein, vector or plasmid refers to discrete elements that areused to introduce heterologous DNA into cells for either expression ofthe heterologous DNA or for replication of the cloned heterologous DNA.Selection and use of such vectors and lasmids are well within the levelof skill of the art.

As used herein, transformation/transfection refers to the process bywhich DNA or RNA is introduced into cells. Transfection refers to thetaking up of exogenous nucleic acid, e.g., an expression vector, by ahost cell whether or not any coding sequences are in fact expressed.Numerous methods of transfection are known to the ordinarily skilledartisan, for example, by direct uptake using calcium phosphate [CaPO4;see, e.g., Wigler et al. (1979) Proc. Natl. Acad. Sci. U.S.A. 76:1373-1376], polyethylene glycol [PEG]-mediated DNA uptake,electroporation, lipofection [see, e.g., Strauss (1996) Meth. Mol. Biol.54: 307-327], microcell fusion [see, EXAMPLES, see, also Lambert (1991)Proc. Natl. Acad. Sci. U.S.A. 88: 5907-5911; U.S. Pat. No. 5,396,767,Sawford et al. (1987) Somatic Cell Mol. Genet. 13: 279-284; Dhar et al.(1984) Somatic Cell Mol. Genet. 10: 547-559; and McNeill-Killary et al.(1995) Meth. Enzymol. 254: 133-152], lipid-mediated carrier systems[see, e.g., Teifel et al. (1995) Biotechnigues 19: 79-80; Albrecht etal. (1996) Ann. Hematol. 72: 73-79; Holmen et al. (1995) In Vitro CellDev. Biol. Anim. 31: 347-351; REmy et al. (1994) Bioconjug. Chem. 5:647-654; Le Bolch et al. (1995) Tetrahedron Lett. 36: 6681-6684;Loeffler et al. (1993) Meth. Enzymol. 217: 599-618] or other suitablemethod. Successful transfection is generally recognized by detection ofthe presence of the heterologous nucleic acid within the transfectedcell, such as any indication of the operation of a vector within thehost cell. Transformation means introducing DNA into an organism so thatthe DNA is replicable, either as an extrachromosomal element or bychromosomal integration.

As used herein, injected refers to the microinjection [use of a smallsyringe] of DNA into a cell.

As used herein, substantially homologous DNA refers to DNA that includesa sequence of nucleotides that is sufficiently similar to another suchsequence to form stable hybrids under specified conditions.

It is well known to those of skill in this art that nucleic acidfragments with different sequences may, under the same conditions,hybridize detectably to the same “target” nucleic acid. Two nucleic acidfragments hybridize detectably, under stringent conditions over asufficiently long hybridization period, because one fragment contains asegment of at least about 14 nucleotides in a sequence which iscomplementary [or nearly complementary] to the sequence of at least onesegment in the other nucleic acid fragment. If the time during whichhybridization is allowed to occur is held constant, at a value duringwhich, under preselected stringency conditions, two nucleic acidfragments with exactly complementary base-pairing segments hybridizedetectably to each other, departures from exact complementarity can beintroduced into the base-pairing segments, and base-pairing willnonetheless occur to an extent sufficient to make hybridizationdetectable. As the departure from complementarity between thebase-pairing segments of two nucleic acids becomes larger, and asconditions of the hybridization become more stringent, the probabilitydecreases that the two segments will hybridize detectably to each other.

Two single-stranded nucleic acid segments have “substantially the samesequence,” within the meaning of the present specification, if (a) bothform a base-paired duplex with the same segment, and (b) the meltingtemperatures of said two duplexes in a solution of 0.5.times. SSPEdiffer by less than 10.degree. C. If the segments being compared havethe same number of bases, then to have “substantially the samesequence”, they will typically differ in their sequences at fewer than 1base in 10. Methods for determining melting temperatures of nucleic acidduplexes are well known [see, e.g., Meinkoth and Wahl (1984) Anal.Biochem. 138: 267-284 and references cited therein].

As used herein, a nucleic acid probe is a DNA or RNA fragment thatincludes a sufficient number of nucleotides to specifically hybridize toDNA or RNA that includes identical or closely related sequences ofnucleotides. A probe may contain any number of nucleotides, from as fewas about 10 and as many as hundreds of thousands of nucleotides. Theconditions and protocols for such hybridization reactions are well knownto those of skill in the art as are the effects of probe size,temperature, degree of ismatch, salt concentration and other parameterson the hybridization reaction. For example, the lower the temperatureand higher the salt concentration at which the hybridization reaction iscarried out, the greater the degree of mismatch that may be present inthe hybrid molecules.

To be used as a hybridization probe, the nucleic acid is generallyrendered detectable by labelling it with a detectable moiety or label,such as .sup.32 p, .sup.3H and .sup. 14 C, or by other means, includingchemical labelling, such as by nick-translation in the presence ofdeoxyuridylate biotinylated at the 5′-position of the uracil moiety. Theresulting probe includes the biotinylated uridylate in place ofthymidylate residues and can be detected [via the biotin moieties] byany of a number of commercially available detection systems based onbinding of streptavidin to the biotin. Such commercially availabledetection systems can be obtained, for example, from Enzo Biochemicals,Inc. [New York, N.Y.]. Any other label known to those of kill in theart, including non-radioactive labels, may be used as long as it rendersthe probes sufficiently detectable, which is a function of thesensitivity of the assay, the time available [for culturing cells,extracting DNA, and hybridization assays], the quantity of DNA or RNAavailable as a source of the probe, the particular label and the meansused to detect the label.

Once sequences with a sufficiently high degree of homology to the probeare identified, they can readily be isolated by standard techniques,which are described, for example, by Maniatis et al. ((1982) MolecularCloning: A Laboratory Manual, Cold Spring Harbor Laboratory Press, ColdSpring Harbor, N.Y.).

As used herein, conditions under which DNA molecules form stable hybridsand are considered substantially homologous are such that DNA moleculeswith at least about 60% complementarity form stable hybrids. Such DNAfragments are herein considered to be “substantially homologous”. Forexample, DNA that encodes a particular protein is substantiallyhomologous to another DNA fragment if the DNA forms stable hybrids suchthat the sequences of the fragments are at least about 60% complementaryand if a protein encoded by the DNA retains its activity.

For purposes herein, the following stringency conditions are defined:

-   -   1) high stringency: 0.1.times.SSPE, 0.1% SDS, 65.degree. C.    -   2) medium stringency: 0.2.times.SSPE, 0.1% SDS, 50.degree. C.    -   3) low stringency: 1.0×SSPE, 0.1% SDS, 50.degree. C.    -   or any combination of salt and temperature and other reagents        that result in selection of the same degree of mismatch or        matching.

As used herein, immunoprotective refers to the ability of a vaccine orexposure to an antigen or immunity-inducing agent, to confer upon a hostto whom the vaccine or antigen is administered or introduced, theability to resist infection by a disease-causing pathogen or to havereduced symptoms. The selected antigen is typically an antigen that ispresented by the pathogen.

As used herein, all assays and procedures, such as hybridizationreactions and antibody-antigen reactions, unless otherwise specified,are conducted under conditions recognized by those of skill in the artas standard conditions.

3.6.5 Preparation of Cell Lines Containing MACs

The methods, cells and MACs provided herein are produced by virtue ofthe discovery of the existence of a higher-order replication unit[megareplicon] of the centromeric region. This megareplicon is delimitedby a primary replication initiation site [megareplicator], and appearsto facilitate replication of the centromeric heterochromatin, and mostlikely, centromeres. Integration of heterologous DNA into themegareplicator region or in close proximity thereto, initiates alarge-scale amplification of megabase-size chromosomal segments, whichleads to de novo chromosome formation in living cells.

Cell lines containing MACs can be prepared by transforming cells,preferably a stable cell line, with a heterologous DNA fragment thatencodes a selectable marker, culturing under selective conditions, andidentifying cells that have a multicentric, typically dicentric,chromosome. These cells can then be manipulated as described herein toproduce the minichromosomes and other MACs, particularly theheterochromatic SATACs as described herein.

Development of a multicentric, particularly dicentric, chromosometypically is effected through integration of the heterologous DNA in thepericentric heterochromatin. Thus, the probability of incorporation canbe increased by including DNA, such as satellite DNA, in theheterologous fragment that encodes the selectable marker. The resultingcell lines can then be treated as the exemplified cells herein toproduce cells in which the dicentric chromosome has fragmented and tointroduce additional selective markers into the dicentric chromosome,whereby amplification of the pericentric heterochromatin will producethe heterochromatic chromosomes. The following discussion is withreference to the EC3/7 line and use of resulting cells. The sameprocedures can be applied to any other cells, particularly cell lines tocreate SATACs and euchromatic minichromosomes.

3.6.5.1 Formation of De Novo Chromosomes

De novo centromere formation in a transformed mouse LMTK-fibro-blastcell line [EC3/7] after cointegration of lambda constructs [lambda CM8and lambda gtWESneo] carrying human and bacterial DNA [Hadlaczky et al.(1991) Proc. Natl. Acad. Sci. U.S.A. 88: 8106-8110 and U.S. applicationSer. No. 08/375,271] has been shown. The integration of the“heterologous” engineered human, bacterial and phage DNA, and thesubsequent amplification of mouse and heterologous DNA that led to theformation of a dicentric chromosome, occurred at the centromeric regionof the short arm of a mouse chromosome. By G-banding, this chromosomewas identified as mouse chromosome 7. Because of the presence of twofunctionally active centromeres on the same chromosome, regularbreakages occur between the centromeres. Such specific chromosomebreakages gave rise to the appearance [in approximately 10% of thecells] of a chromosome fragment carrying the neo-centromere. From theEC3/7 cell line [see, U.S. Pat. No. 5,288,625, deposited at the EuropeanCollection of Animal Cell Culture (hereinafter ECACC) under accessionno. 90051001; see, also Hadlaczky et al. (1991) Proc. Natl. Acad. Sci.U.S.A. 88: 8106-8110, and U.S. application Ser. No. 08/375,271 and thecorresponding published European application EP 0 473 253, two sublines[EC3/7C5 and EC3/7C6] were selected by repeated single-cell cloning. Inthese cell lines, the neo-centromere was found exclusively on aminichromosome [neo-minichromosome], while the formerly dicentricchromosome carried traces of “heterologous” DNA.

It has now been discovered that integration of DNA encoding a selectablemarker in the heterochromatic region of the centromere led to formationof the dicentric chromosome.

3.6.5.2 The Neo-Minichromosome

The chromosome breakage in the EC3/7 cells, which separates theneo-centromere from the mouse chromosome, occurred in the G-bandpositive “heterologous” DNA region. This is supported by the observationof traces of lambda and human DNA sequences at the broken end of theformerly dicentric chromosome. Comparing the G-band pattern of thechromosome fragment carrying the neo-centromere with that of the stableneo-minichromosome, it is apparent that the neo-minichromosome is aninverted duplicate of the chromosome fragment that bears theneo-centromere. This is supported by the observation that although theneo-minichromosome carries only one functional centromere, both ends ofthe minichromosome are heterochromatic, and mouse satellite DNAsequences were found in these heterochromatic regions by in situhybridization.

Mouse cells containing the minichromosome, which contains multiplerepeats of the heterologous DNA, which in the exemplified embodiment islambda DNA and the neomycin-resistance gene, can be used as recipientcells in cell transformation. Donor DNA, such as selected heterologousDNA containing lambda DNA linked to a second selectable marker, such asthe gene encoding hygromycin phosphotransferase which confers hygromycinresistance [hyg], can be introduced into the mouse cells and integratedinto the minichromosomes by homologous recombination of lambda DNA inthe donor DNA with that in the minichromosomes. Integration is verifiedby in situ hybridization and Southern blot analyses. Transcription andtranslation of the heterologous DNA is confirmed by primer extension andimmunoblot analyses.

For example, DNA has been targeted into the lambda neo-minichromosome inEC3/7C5 cells using a lambda DNA-containing construct [pNem1ruc] thatalso contains DNA encoding hygromycin resistance and the Renillaluciferase gene linked to a promoter, such as the cytomegalovirus [CMV]early promoter, and the bacterial neomycin resistance-encoding DNA.Integration of the donor DNA into the chromosome in selected cells[designated PHN4] was confirmed by nucleic acid amplification [PCR] andin situ hybridization.

The resulting engineered minichromosome that contains the heterologousDNA can then be transferred by cell fusion into a recipient cell line,such as Chinese hamster ovary cells [CHO] and correct expression of theheterologous DNA can be verified. Following production of the cells,metaphase chromosomes are obtained, such as by addition of colchicine,and the chromosomes purified by addition of AT and GC specific dyes on adual laser beam based cell sorter. Preparative amounts of chromosomes[5×10⁴-5×10⁷ chromosomes/ml] at a purity of 95% or higher can beobtained. The resulting chromosomes are used for delivery to cells bymethods such as microinjection and liposome-mediated transfer.

Thus, the neo-minichromosome is stably maintained in cells, replicatesautonomously, and permits the persistent long-term expression of the neogene unde; non-selective culture conditions. It also contains megabasesof heterologous known DNA [lambda NA in the exemplified embodiments]that serves as target sites for homologous recombination and integrationof DNA of interest. The neo-minichromosome is, thus, a vector forgenetic engineering of cells.

The methods herein provide means to induce the events that lead toformation of the neo-minichromosome by introducing heterologous DNA witha selective marker (preferably a dominant selectable marker] into cellsand culturing the cells under selective conditions. As a result, cellsthat contain a multicentric, e.g., dicentric chromosome, or fragmentsthereof, generated by amplification are produced. Cells with thedicentric chromosome can then be treated to destabilize the chromosomeswith agents, such as BrdU and/or culturing under selective conditions,resulting in cells in which the dicentric chromosome has formed twochromosomes, a so-called minichromosome, and a formerly dicentricchromosome that has typically undergone amplification in theheterochromatin where the heterologous DNA has integrated to produce aSATAC or a sausage chromosome [discussed below]. These cells can befused with other cells to separate the minichromosome from the formerlydicentric chromosome into different cells so hat each type of MAC can bemanipulated separately.

3.6.5.3 Preparation of SATACs

To prepare a SATAC, the starting materials are cells, preferably astable cell line, such as a fibroblast cell line, and a DNA fragmentthat includes DNA that encodes a selective marker. The DNA fragment isintroduced into the cell by methods of DNA transfer, including but notlimited to direct uptake using calcium phosphate, electroporation, andlipid-mediated transfer. To insure integration of the DNA fragment inthe heterochromatin, it is preferable to start with DNA that will betargeted to the pericentric heterochromatic region of the chromosome,such as lambda CM8 and vectors provided herein, such as pTEMPUD thatinclude satellite DNA. After introduction of the DNA, the cells aregrown under selective conditions. The resulting cells are examined andany that have multicentric, particularly dicentric, chromosomes, orheterochromatic chromosomes or sausage chromosomes or other suchstructure are selected.

In particular, if a cell with a dicentric chromosome is selected, it canbe grown under selective conditions, or, preferably, additional DNAencoding a second selectable marker is introduced, and the cells grownunder conditions selective for the second marker. Cells with astructure, such as the sausage chromosome, can be selected and fusedwith a second cell line to eliminate other chromosomes that are not ofinterest. If desired, cells with other chromosomes can be selected andtreated as described herein. If a cell with a sausage chromosome isselected, it can be treated with an agent, such as BrdU, thatdestabilizes the chromosome so that the heterochromatic arm forms achromosome that is substantially heterochromatic [i.e., amegachromosome]. Structures such as the gigachromsome in which theheterochromatic arm has amplified but not broken off from theeuchromatic arm, will also be observed. The megachromosome is a stablechromosome. Further manipulation, such as fusions and growth inselective conditions and/or BrdU treatment or other such treatment, canlead to fragmentation of the egachromosome to form smaller chromosomesthat have the amplicon as the basic repeating unit.

The megachromosome can be further fragmented in vivo using a chromosomefragmentation vector, such as pTEMPUD to ultimately produce a chromosomethat comprises a smaller stable replicable unit, about 15 Mb-60 Mb,containing one to four megareplicons.

Thus, the stable chromosomes formed de novo that originate from theshort arm of mouse chromosome 7 have been analyzed. This chromosomeregion shows a capacity for amplification of large chromosome segments,and promotes de novo chromosome formation. Large-scale amplification atthe same chromosome region leads to the formation of dicentric andmulticentric chromosomes, a minichromosome, the 150-200 Mb size lambdaneo-chromosome, the “sausage” chromosome, the 500-1000 Mbgigachromosome, and the stable 250400 Mb megachromosome.

A clear segmentation is observed along the arms of the megachromosome,and analyses show that the building units of this chromosome areamplicons of ˜30 Mb composed of mouse major satellite DNA with theintegrated “foreign” DNA sequences at both ends. The ˜30 Mb ampliconsare composed of two ˜15 Mb inverted doublets of ˜7.5 Mb mouse majorsatellite DNA blocks, which are separated from each other by a narrowband of non-satellite sequences. The wider non-satellite regions at theamplicon borders contain integrated, exogenous [heterologous] DNA, whilethe narrow bands of non-satellite DNA sequences within the amplicons areintegral parts of the pericentric heterochromatin of mouse chromosomes.These results indicate that the ˜7.5 Mb blocks flanked by non-satelliteDNA are the building units of the pericentric heterochromatin of mousechromosomes, and the ˜15 Mb size pericentric regions of mousechromosomes contain two ˜7.5 Mb units.

Apart from the euchromatic terminal segments, the whole megachromosomeis heterochromatic, and has structural homogeneity. Therefore, thislarge chromosome offers a unique possibility for obtaining informationabout the amplification process, and for analyzing some basiccharacteristics of the pericentric constitutive heterochromatin, as avector for heterologous DNA, and as a target for further fragmentation.

As shown herein, this phenomenon is generalizable and can be observedwith other chromosomes. Also, although these de novo formed chromosomesegments and chromosomes appear different, there are similarities thatindicate that a similar amplification mechanism plays a role in theirformation: (i) in each case, the amplification is initiated in thecentromeric region of the mouse chromosomes and large (Mb size)amplicons are formed; (ii) mouse major satellite DNA sequences areconstant constituents of the amplicons, either by providing the bulk ofthe heterochromatic amplicons [H-type amplification], or by borderingthe euchromatic amplicons [E-type amplification]; (iii) formation ofinverted segments can be demonstrated in the lambda neo-chromosome andmegachromosome; (iv) chromosome arms and chromosomes formed by theamplification are stable and functional.

The presence of inverted chromosome segments seems to be a commonphenomenon in the chromosomes formed de novo at the centromeric regionof mouse chromosome 7. During the formation of the neo-minichromosome,the event leading to the stabilization of the distal segment of mousechromosome 7 that bears the neo-centromere may have been the formationof its inverted duplicate. Amplicons of the megachromosome are inverteddoublets of-7.5 Mb mouse major satellite DNA blocks.

3.6.5.4 Cell Lines

Cell lines that contain MACs, such as the minichromosome, the.lambda.-neo chromosome, and the SATACs are provided herein or can beproduced by the methods herein. Such cell lines provide a convenientsource of these chromosomes and can be manipulated, such as by cellfusion or production of microcells for fusion with selected cell lines,to deliver the chromosome of interest into hybrid cell lines. Exemplarycell lines are described herein and some have been deposited with theECACC.

3.6.5.4.1 EC317C5 and EC37C6

Cell lines EC3/7C5 and EC3/7C6 were produced by single cell cloning ofEC3/7. For exemplary purposes EC3/7C5 has been deposited with the ECACC.These cell lines contain a minichromosome and the formerly dicentricchromosome from EC3/7. The stable mini-chromosomes in cell lines EC31C5and EC3/7C6 appear to be the same and they seem to be duplicatedderivatives of the 10-15 Mb “broken-off” fragment of the dicentricchromosome. Their similar size in these independently generated celllines might indicate that about 20-30 Mb is the minimal or close to theminimal physical size for a stable minichromosome.

3.6.5.4.2 TF1004G19

Introduction of additional heterologous DNA, including DNA encoding asecond selectable marker, hygromycin phosphotransferase, i.e., thehygromycin-resistance gene, and also a detectable marker,beta-galactosidase (i.e., encoded by the lacz gene), into the EC3/7C5cell line and growth under selective conditions produced cellsdesignated TF1004G19. In particular, this cell line was produced fromthe EC3/7C5 cell line by cotransfection with plasmids pH 132, whichcontains an anti-HIV ribozyme and hygromycin-resistance gene, pCH 110[encodes, beta-galactosidase] and lambda phage [lambdac1875 Sam 7] DNAand selection with hygromycin B.

Detailed analysis of the TF1004G 19 cell line by in situ hybridizationwith lambda phage and plasmid DNA sequences revealed the formation ofthe sausage chromosome. The formerly dicentric chromosome of the EC317C5cell line translocated to the end of another acrocentric chromosome. Theheterologous DNA integrated into the pericentric heterochromatin of theformerly dicentric chromosome and is amplified several times withmegabases of mouse pericentric heterochromatic satellite DNA sequencesforming the “sausage” chromosome. Subsequently the acrocentric mousechromosome was substituted by a euchromatic telomere.

In situ hybridization with biotin-labeled subfragments of thehygromycin-resistance and, beta-galactosidase genes resulted in ahybridization signal only in the heterochromatic arm of the sausagechromosome, indicating that in TF 1004G 19 transformant cells thesegenes are localized in the pericentric heterochromatin. A high level ofgene expression, however, was detected.

In general, heterochromatin has a silencing effect in Drosophila, yeastand on the HSV-tk gene introduced into satellite DNA at the mousecentromere. Thus, it was of interest to study the TF 1004G 19transformed cell line to confirm that genes located in theheterochromatin were indeed expressed, contrary to recognized dogma.

For this purpose, subclones of TF1004G19, containing a different sausagechromosome, were established by single cell cloning. Southernhybridization of DNA isolated from the subclones with subfragments ofhygromycin phosphotransferase and lacZ genes showed a close correlationbetween the intensity of hybridization and the length of the sausagechromosome. This finding supports the conclusion that these genes arelocalized in the heterochromatic arm of the sausage chromosome.

3.6.5.4.2.1 T1004G-19C5

TF1004G-19C5 is a mouse LMTK-fibroblast cell line containingneo-minichromosomes and stable “sausage” chromosomes. It is a subcloneof TF1004G19 and was generated by single-cell cloning of the TF1004G19cell line. It has been deposited with the ECACC as an exemplary cellline and exemplary source of a sausage chromosome. Subsequent fusion ofthis cell line with CHO K20 cells and selection with hygromycin and G418and HAT (hypoxanthine, aminopteria, and thymidine medium, see Szybalskiet al. (1962) Proc. Natl. Acad. Sci. 48: 2026) resulted in hybrid cells(designated 19C5xHa4) that carry the sausage chromosome and/or theneo-minichromosome. BrdU treatment of the hybrid cells, followed bysingle cell cloning and selection with G418 and/or hygromycin producedvarious cells that carry chromosomes of interest, including G43 andG3D5.

3.6.5.4.2.2 Other Subclones

Cell lines GB43 and G3D5 were obtained by treating 19C5xHa4 cells withBrdU followed by growth in G418-containing selective medium andretreatment with BrdU. The two cell lines were isolated by single cellcloning of the selected cells. GB43 cells contain the neo-minichromosomeonly. G3D5, which has been deposited with the ECACC, carries theneo-minichromosome and the megachromosome. Single cell cloning of thiscell line followed by growth of the subclones in G418- andhygromycin-containing medium yielded subclones such as the GHB42 cellline carrying the neo-minichromosome and the megachromosome. H1D3 is amouse-hamster hybrid cell line carrying the megachromosome, but noneo-minichromosome, and was generated by treating 19C5xHa4 cells withrdU followed by growth in hygromycin-containing selective medium andsingle cell subcloning of selected cells. Fusion of this cell line withthe CD4⁺ HeLa cell line that also carries DNA encoding an additionalselection gene, the neomycin-resistance gene, produced cells [designatedH1xHE41 cells) that carry the megachromosome as well as a humanchromosome that carries CD4neo [H1D3 cells]. Further BrdU treatment andsingle cell cloning produced cell lines, such as I B3, that includecells with a truncated megachromosome.

3.6.5.5 DNA Constructs Used to Transform the Cells

Heterologous DNA can be introduced into the cells by transfection orother suitable method at any stage during preparation of thechromosomes. In general, incorporation of such DNA into the MACs isassured through site-directed integration, such as may be accomplishedby inclusion of lambda-DNA in the heterologous DNA (for the exemplifiedchromosomes), and also an additional selective marker gene. For example,cells containing a MAC, such as the minichromosome or a SATAC, can becotransfected with a plasmid carrying the desired heterologous DNA, suchas DNA encoding an HIV ribozyme, the cystic fibrosis gene, and DNAencoding a second selectable marker, such as hygromycin resistance.Selective pressure is then applied to the cells by exposing them to anagent that is harmful to cells that do not express the new selectablemarker. In this manner, cells that include the heterologous DNA in theMAC are identified. Fusion with a second cell line can provide a meansto produce cell lines that contain one particular type of chromosomalstructure or MAC.

Various vectors for this purpose can be readily constructed. The vectorspreferably include DNA that is homologous to DNA contained within a MACin order to target the DNA to the MAC for integration therein. Thevectors also include a selectable marker gene and the selectedheterologous gene(s) of interest. Based on the disclosure herein and theknowledge of the skilled artisan, one of skill can construct suchvectors.

Of particular interest herein is the vector pTEMPUD and derivativesthereof that can target DNA into the heterochromatic region of selectedchromosomes. These vectors can also serve as fragmentation vectors.

Heterologous genes of interest include any gene that encodes atherapeutic product and DNA encoding gene products of interest. Thesegenes and DNA include, but are not limited to: the cystic fibrosis gene[CF], the cystic fibrosis transmembrane regulator (CFTR) gene [see,e.g., U.S. Pat. No. 5,240,846; Rosenfeld et al. (1992) Cell 68: 143-155;Hyde et al. (1993) Nature 362: 250-255; Kerem et al. (1989) Science 245:1073-1080; Riordan et al. (1989) Science 245: 1066-1072; Rommens et al.(1989) Science 245: 1059-1065; Osborne et al. (1991) Am. J. Hum.Genetics 48: 6089-6122; White et al. (1990) Nature 344: 665-667; Dean etal. (1990) Cell 61: 863-870; Erlich et al. (1991) Science 252: 1643; andU.S. Pat. Nos. 5,453,357, 5,449,604, 5,434,086, and 5,240,846, whichprovides a retroviral vector encoding the normal CFTR gene].

3.6.6 Isolation of Artificial Chromosomes

The MACs provided herein can be isolated by any suitable method known tothose of skill in the art. Also, a method is provided herein foreffecting substantial purification, particularly of the SATACs. SATACshave been isolated by fluorescence-activated cell sorting [FACS]. Thismethod takes advantage of the nucleotide base content of the SATACs,which, by virtue of their heterochromatic DNA content, will differ fromany other chromosomes in a cell. In particular, metaphase chromosomesare isolated (e.g., by addition of colchicine) and stained withbase-specific dyes, such as Hoechst 33258 and chromomycin A3.Fluorescence-activated cell sorting will separate the SATACs from thegenomic chromosomes. A dual-laser cell sorter FACStar Plus and FAXStarVantage Becton Dickinson Immunocytometry System] in which two laserswere set to excite the dyes separately, allowed a bivariate analysis ofthe chromosomes by base-pair composition and size. Cells containing suchSATACs can be similarly sorted.

3.6.7 Introduction of Artificial Chromosomes into Cells, Tissues,Animals and Plants

Suitable hosts for introduction of the MACs provided herein, include,but are not limited to, any animal or plant, cell or tissue thereof,including, but not limited to: mammals, birds, reptiles, amphibians,insects, fish, arachnids, tobacco, tomato, wheat, plants and algae. TheMACs, if contained in cells, may be introduced by cell fusion ormicrocell fusion or, if the MACs have been isolated from cells, they maybe introduced into host cells by any method known to those of skill inthis art, including but not limited to: direct DNA transfer,electroporation, lipid-mediated transfer, e.g., lipofection andliposomes, microprojectile bombardment, microinjection in cells andembryos, protoplast regeneration for plants, and any other suitablemethod [see, e.g., Weissbach et al. (1988) Methods for Plant MolecularBiology, Academic Press, N.Y., Section VIII, pp. 421-463; Grierson etal. (1988) Plant Molecular Biology, 2d Ed., Blackie, London, Ch. 7-9;see, also U.S. Pat. Nos. 5,491,075; 5,482,928; and 5,424,409; see, also,e.g., U.S. Pat. No. 5,470,708, which describes particle-mediatedtransformation of mammalian unattached cells].

Other methods for introducing DNA into cells include nuclearmicroinjection, electroporation, and bacterial protoplast fusion withintact cells. Polycations, such as polybrene and polyomithine, may alsobe used. For various techniques for transforming mammalian cells, seee.g., Keown et al. Methods in Enzymology (1990) Vol. 185, pp. 527-537;and Mansour et al. (1988) Nature 336: 348-352.

DNA may be introduced by direct DNA transformation; microinjection incells or embryos, protoplast regeneration for plants, electroporation,microprojectile gun and other such methods [see, e.g., Weissbach et al.(1988) Methods for Plant Molecular Biology, Academic Press, N.Y.,Section VIII, pp. 421-463; Grierson et al. (1988) Plant MolecularBiology, 2d Ed., Blackie, London, Ch. 7-9; see, also U.S. Pat. Nos.5,491,075; 5,482,928; and 5,424,409; see, also, e.g., U.S. Pat. No.5,470,708, which describes particle-mediated transformation of mammalianunattached cells].

For example, isolated, purified artificial chromosomes can be injectedinto an embryonic cell line such as a human kidney primary embryoniccell line [ATCC accession number CRL 1573] or embryonic stem cells [see,e.g., Hogan et al. (1994) Manipulating he Mouse Embryo, A: LaboratoryManual, Cold Spring Harbor Laboratory Press, Cold Spring Harbor, N.Y.,see, especially, pages 255-264 and Appendix 3]. Preferably thechromosomes are introduced by microinjection, using a system such as theEppendorf automated microinjection system, and grown under selectiveconditions, such as in the presence of hygromycin B or neomycin.

3.6.7.1. Methods for Introduction of Chromosomes into Hosts

Depending on the host cell used, transformation is done using standardtechniques appropriate to such cells. These methods include any,including those described herein, known to those of skill in the art.

3.6.7.1.1 DNA Uptake

For mammalian cells that do not have cell walls, the calcium phosphateprecipitation method for introduction of exogenous DNA [see, e.g.,Graham et al. (1978) Virology 52: 456457; Wigler et al. (1979) Proc.Natl. Acad. Sci. U.S.A. 76: 1373-1376; and Current Protocols inMolecular Biology, Vol. 1, Wiley Inter-Science, Supplement 14, Unit9.1.1-9.1.9 (1990)] is often preferred. DNA uptake can be accomplishedby DNA alone or in the presence of polyethylene glycol [PEG-mediatedgene transfer], which is a fusion agent, or by any variations of suchmethods known to those of skill in the art [see, e.g., U.S. Pat. No.4,684,611].

Lipid-mediated carrier systems are also among the preferred methods forintroduction of DNA into cells [see, e.g., Teifel et al. (1995)Biotechniques 19: 79-80; Albrecht et al. (1996) Ann. Hematol. 72: 73-79;Holmen et al. (1995) In Vitro Cell Dev. Biol. Anim. 31: 347-351; Remy etal. (I 994) Bioconjug. Chem. 5: 647-654; Le Boic h et al. (1995)Tetrahedron Lett. 36: 6681-6684; Loeffler et al. (1993) Meth. Enzymol.217: 599-618]. Lipofection [see, e.g., Strauss (1996) Meth. Mol. Biol.54: 307-327] may also be used to introduce DNA into cells. This methodis particularly well-suited for transfer of exogenous DNA into chickencells (e.g., chicken blastodermal cells and primary chicken fibroblasts;see Brazolot et al. (1991) Mol. Repro. Dev. 30: 304-312). In particular,DNA of interest can be introduced into chickens in operative linkagewith promoters from genes, such as lysozyme and ovalbumin, that areexpressed in the egg, thereby permitting expression of the heterologousDNA in the egg.

Additional methods useful in the direct transfer of DNA into cellsinclude particle gun electrofusion [see, e.g., U.S. Pat. Nos. 4,955,378,4,923,814, 4,476,004, 4,906,576 and 4,441,972] and virion-mediated genetransfer.

A commonly used approach for gene transfer in land plants involves thedirect introduction of purified DNA into protoplasts. The three basicmethods for direct gene transfer into plant cells include: 1)polyethylene glycol [PEG]-mediated DNA uptake, 2)electroporation-mediated DNA uptake and 3) microinjection. In addition,plants may be transformed using ultrasound treatment [see, e.g.,International PCT application publication No. WO 91/00358].

3.6.7.1.2 Electroporation

Electroporation involves providing high-voltage electrical pulses to asolution containing a mixture of protoplasts and foreign DNA to createreversible pores in the membranes of plant protoplasts as well as othercells. Electroporation is generally used for prokaryotes or other cells,such as plants that contain substantial cell-wall. barriers. Methods foreffecting electroporation are well known [see, e.g., U.S. Pat. Nos.4,784,737, 5,501,967, 5,501,662, 5,019,034, 5,503,999; see, also Frommet al. 1985) Proc. Natl. Acad. Sci. U.S.A. 82: 5824-5828].

For example, electroporation is often used for transformation of plants[see, e.g., Ag Biotechnology News 7: 3 and 17 (September/October 1990)].In this technique, plant protoplasts are electroporated in the presenceof the DNA of interest that also includes a phenotypic marker.Electrical impulses of high field strength reversibly permeabilizebiomembranes allowing the introduction of the plasmids. Electroporatedplant protoplasts reform the cell wall, divide, and form plant callus.Transformed plant cells will be identified by virtue of the expressedphenotypic marker. The exogenous DNA may be added to the protoplasts inany form such as, for example, naked linear, circular or supercoiledDNA, DNA encapsulated in liposomes, DNA in spheroplasts, DNA in otherplant protoplasts, DNA complexed with salts, and other methods.

3.6.7.1.3 Microcells

The chromosomes can be transferred by preparing microcells containing anartificial chromosome and then fusing with selected target cells.Methods for such preparation and fusion of microcells are well known[see, e.g., U.S. Pat. Nos. 5,240,840, 4, 806,476, 5,298,429, 5,396,767,Foumier (1981) Proc. Natl. Acad. Sci. U.S.A. 78: 6349-6353; and Lambertet al. (1991) Proc. Natl. Acad. Sci. U.S.A. 88: 5907-59]. Microcellfusion, using microcells that contain an artificial chromosome, is aparticularly useful method for introduction of MACs into avian cells,such as DT40 chicken pre-B cells [for a description of DT40 cell fusion,see, e.g., Dieken et al. (1996) Nature Genet. 12: 174-182].

3.6.7.2. Hosts

Suitable hosts include any host known to be useful for introduction andexpression of heterologous DNA. Of particular interest herein, animaland plant cells and tissues, including, but not limited to insect cellsand larvae, plants, and animals, particularly transgenic animals, andanimal cells. Other hosts include, but are not limited to mammals,birds, particularly fowl such as chickens, reptiles, amphibians,insects, fish, arachnids, tobacco, tomato, wheat, monocots, dicots andalgae, and any host into which introduction of heterologous DNA isdesired. Such introduction can be effected using the MACs providedherein, or, if necessary by using the MACs provided herein to identifyspecies-specific centromeres and/or functional chromosomal units andthen using the resulting centromeres or chromosomal units as artificialchromosomes, or alternatively, using the methods exemplified herein forproduction of MACs to produce species-specific artificial chromosomes.

3.6.7.2.1 Introduction of DNA into Embryos for Production of TransgenicAnimals and Introduction of DNA into Animal Cells

Transgenic animals can be produced by introducing exogenous geneticmaterial into a pronucleus of a mammalian zygote by microinjection [see,e.g., U.S. Pat. Nos. 4,873,191 and 5,354,674; see, also, InternationalPCT application publication No. WO95/14769, which is based on U.S.application Ser. No. 08/159,084). The zygote is capable of developmentinto a mammal. The embryo or zygote is transplanted into a host femaleuterus and allowed to develop. Detailed protocols and examples are setforth below.

Transgenic chickens can be produced by injection of dispersedblastodermal cells from Stage X chicken embryos into recipient embryosat a similar stage of development [see e.g., Etches et al. (1993)Poultry Sci. 72: 882-889; Petitte et al. (1990) Development 108:185-189]. Heterologous DNA is first introduced into the donorblastodermal cells using methods such as, for example, lipofection [see,e.g., Brazolot et al. (1991) Mol. Repro. Dev. 30: 304-312] or microcellfusion [see, e.g., Dieken et al. 1996) Nature Genet. 12: 174-182]. Thetransfected donor cells are then injected into recipient chicken embryos[see e.g., Carsience et al. (1993) Development 117: 669-675]. Therecipient chicken embryos within the shell are candled and allowed tohatch to yield a germline chimeric chicken.

DNA can be introduced into animal cells using any known procedure,including, but not limited to: direct uptake, incubation withpolyethylene glycol [PEG], microinjection, electroporation, lipofection,cell fusion, microcell fusion, particle bombardment, includingmicroprojectile bombardment [see, eg., U.S. Pat. No. 5,470,708, whichprovides a method for transforming unattached mammalian cells viaparticle bombardment], and any other such method. For example, thetransfer of plasmid DNA in liposomes directly to human cells in situ hasbeen approved by the FDA for use in humans [see, eg., Nabel, et al.(1990) Science 249: 1285-1288 and U.S. Pat. No. 5,461,032].

3.6.7.2.2 Introduction of Heterologous DNA into Plants.

Numerous methods for producing or developing transgenic plants areavailable to those of skill in the art. The method used is primarily afunction of the species of plant. These methods include, but are notlimited to: direct transfer of DNA by processes, such as PEG-induced DNAuptake, protoplast fusion, microinjection, electroporation, andmicroprojectile bombardment [see, e.g., Uchimiya et al. (1989) J. ofBiotech. 12: 1-20 for a review of such procedures, see, also, e.g., U.S.Pat. Nos. 5,436,392 and 5,489,520 and many others]. For purposes herein,when introducing a MAC, microinjection, protoplast fusion and particlegun bombardment are preferred.

Plant species, including tobacco, rice, maize, rye, soybean, Brassicanapus, cotton, lettuce, potato and tomato, have been used to producetransgenic plants. Tobacco and other species, such as petunias, oftenserve as experimental models in which the methods have been developedand the genes first introduced and expressed.

DNA uptake can be accomplished by DNA alone or in the presence of PEG,which is a fusion agent, with plant protoplasts or by any variations ofsuch methods known to those of skill in the art [see, e.g., U.S. Pat.No. 4,684,611 to Schilperoot et al.]. Electroporation, which involveshigh-voltage electrical pulses to a solution containing a mixture ofprotoplasts and foreign DNA to create reversible pores, has been used,for example, to successfully introduce foreign genes into rice andBrassica napus. Microinjection of DNA into plant cells, includingcultured cells and cells in intact plant organs and embryoids in tissueculture and microprojectile bombardment [acceleration of small highdensity particles, which contain the DNA, to high velocity with aarticle gun apparatus, which forces the particles to penetrate plantcell walls and membranes] have also been used. All plant cells intowhich DNA can be introduced and that can be regenerated from thetransformed cells can be used to produce transformed whole plants whichcontain the transferred artificial chromosome. The particular protocoland means for introduction of the DNA into the plant host may need to beadapted or refined to suit the particular plant species or cultivar.

3.6.7.2.3 Insect Cells

Insects are useful hosts for introduction of artificial chromosomes fornumerous reasons, including, but not limited to: (a) amplification ofgenes encoding useful proteins can be accomplished in the artificialchromosome to obtain higher protein yields in insect cells; (b) insectcells support required post-translational modifications, such asglycosylation and phosphorylation, that can be required for proteinbiological functioning; (c) insect cells do not support mammalianviruses, and, thus, eliminate the problem of cross-contamination ofproducts with such infectious agents; (d) this technology circumventstraditional recombinant baculovirus systems for production ofnutritional, industrial or medicinal proteins in insect cell systems;(e) the low temperature optimum for insect cell growth (28° C.) permitsreduced energy cost of production; (f) serum-free growth medium forinsect cells permits lower production costs; (g) artificialchromosome-containing cells can be stored indefinitely at lowtemperature; and (h) insect larvae will be biological factories forproduction of nutritional, medicinal or industrial proteins bymicroinjection of fertilized insect eggs [see, eq., Joy et al. (1991)Current Science 66: 145-150, which provides a method for microinjectingheterologous DNA into Bombyx mori eggs].

Either MACs or insect-specific artificial chromosomes [BUGACs] will beused to introduce genes into insects. It appears that MACs will functionin insects to direct expression of heterologous DNA contained thereon.For example, a MAC containing the B. mori actin gene promoter fused tothe lacZ gene has been generated by transfection of EC3/7C5 cells with aplasmid containing the fusion gene. Subsequent fusion of the B. moricells with the transfected EC3/7C5 cells that survived selection yieldeda MAC-containing insect-mouse hybrid cell line in which,beta-galactosidase expression was detectable.

Insect host cells include, but are not limited to, hosts such asSpodoptera frugiperda [caterpillar], Aedes aegypti [mosquito], Aedesalbopictus [mosquito], Drosphila melanogaster [fruitfly], Bombyx mori[silkworm], Manduca sexta [tomato horn worm] and Trichoplusia ni[cabbage looper]. Efforts have been directed toward propagation ofinsect cells in culture. Such efforts have focused on the fall armyworm,Spodoptera frugiperda. Cell lines have been developed also from otherinsects such as the cabbage looper, Trichoplusia ni and the silkworm,Bombyx mori. It has also been suggested that analogous cell lines can becreated using the tomato homworm, Manduca sexta. To introduce DNA intoan insect, it should be introduced into the larvae, and allowed toproliferate, and then the hemolymph recovered from the larvae so thatthe proteins can be isolated therefrom.

The preferred method herein for introduction of artificial chromosomesinto insect cells is microinjection [see, e.g., Tamura et al. (1991) BioInd. 8: 26-31; Nikolaev et al. (1989) Mol. Biol. (Moscow) 23: 1177-87;and methods exemplified and discussed erein].

3.6.8 Applications for and Uses of Artificial chromosomes

Artificial chromosomes provide convenient and useful vectors, and insome instances [e.g., in the case of very large heterologous genes] theonly vectors, for introduction of heterologous genes into hosts.Virtually any gene of interest is amenable to introduction into a hostvia artificial chromosomes. Such genes include, but are not limited to,genes that encode receptors, cytokines, enzymes, proteases, hormones,growth factors, antibodies, tumor suppressor genes, therapeutic productsand multigene pathways.

The artificial chromosomes provided herein will be used in methods ofprotein and gene product production, particularly using insects as hostcells for production of such products, and in cellular (e.g., mammaliancell) production systems in which the rtificial chromomsomes(particularly MACs) provide a reliable, stable and efficient means foroptimizing the biomanufacturing of important compounds for medicine andindustry. They are also intended for use in methods of gene therapy, andin for production of transgenic plants and animals [discussed above andbelow].

3.6.8.1 Gene Therapy

Any nucleic acid encoding a therapeutic gene product or product of amultigene pathway may be introduced into a host animal, such as a human,or into a target cell line for introduction into an animal, fortherapeutic purposes. Such therapeutic purposes include, genetic therapyto cure or to provide gene products that are missing or defective, todeliver agents, such as anti-tumor agents, to targeted cells or to ananimal, and to provide gene products that will confer resistance orreduce susceptibility to a pathogen or ameliorate symptoms of a diseaseor disorder. The following are some exemplary genes and gene products.Such exemplification is not intended to be limiting.

3.6.8.1.1 Anti-HIV Ribozymes

As exemplified below, DNA encoding anti-HIV ribozymes can be introducedand expressed in cells using MACs, including the euchromatin-basedminichromosomes and the SATACs. These MACs can be used to make atransgenic mouse that expresses a ribozyme and, thus, serves as a modelfor testing the activity of such ribozymes or from whichribozyme-producing cell lines can be made. Also, introduction of a MACthat encodes an anti-HIV ribozyme into human cells will serve astreatment for HIV infection. Such systems further demonstrate theviability of using any disease-specific ribozyme to treat or amelioratea particular disease.

3.6.8.1.2 Tumor Suppressor Genes

Tumor suppressor genes are genes that, in their wild-type alleles,express proteins that suppress abnormal cellular proliferation. When thegene coding for a tumor suppressor protein is mutated or deleted, theresulting mutant protein or the complete lack of tumor suppressorprotein expression may result in a failure to correctly regulatecellular proliferation. Consequently, abnormal cellular proliferationmay take place, particularly if there is already existing damage to thecellular regulatory mechanism. A number of well-studied human tumors andtumor cell lines have been shown to have missing or nonfunctional tumorsuppressor genes.

Examples of tumor suppression genes include, but are not limited to, theretinoblastoma susceptibility gene or RB gene, the p53 gene, the genethat is deleted in colon carcinoma [i.e., the DCC gene] and theneurofibromatosis type I [NF-1] tumor suppressor gene [see, e.g., U.S.Pat. No. 5,496,731; Weinberg et al. (1991) 254: 1138-1146]. Loss offunction or inactivation of tumor suppressor genes may play a centralrole in the initiation and/or progression of a significant number ofhuman cancers.

3.6.8.1.2.1 The p53 Gene

Somatic cell mutations of the p53 gene are said to be the most frequentof the gene mutations associated with human cancer [see, e.g., Weinberget al. (1991) Science 254: 1138-1146]. The normal or wild-type p53 geneis a negative regulator of cell growth, which, when damaged, favors celltransformation. The p53 expression product is found in the nucleus,where it may act in parallel or cooperatively with other gene products.Tumor cell lines in which p53 has been deleted have been successfullytreated with wild-type p53 vector to reduce tumorigenicity [see, Bakeret al. (1990) Science 249: 912-915].

DNA encoding the p53 gene and plasmids containing this DNA are wellknown [see, e.g., U.S. Pat. No. 5,260,191; see, also Chen et al. (1990)Science 250: 1576; Farrel et al. (1991) EMBO J. 10: 2879-2887; plasmidscontaining the gene are available from the ATCC, and the sequence is inthe GenBank Database, accession nos. X54156, X60020, M14695, M16494,K03199].

3.6.8.1.3 The CFTR Gene

Cystic fibrosis [CF] is an autosomal recessive disease that affectsepithelia of the airways, sweat glands, pancreas, and other organs. Itis a lethal genetic disease associated with a defect in chloride iontransport, and is caused by mutations in the gene coding for the cysticfibrosis transmembrane conductance regulator [CFTR], a 1480 amino acidprotein that has been associated with the expression of chlorideconductance in a variety of eukaryotic cell types. Defects in CFTRdestroy or reduce the ability of epithelial cells in the airways, sweatglands, pancreas and other tissues to transport chloride ions inresponse to cAMP-mediated agonists and impair activation of apicalmembrane channels by cAMP-dependent protein kinase A [PKA]. Given thehigh incidence and devastating nature of this disease, development ofeffective CF treatments is imperative.

The CFTR gene [˜250 kb] [600 kb] can be transferred into a MAC for use,for example, in gene therapy as follows. A CF-YAC [see Green et al.Science 250: 94-98] may be modified to include a selectable marker, suchas a gene encoding a protein that confers resistance to puromycin orhygromycin, and lambda-DNA for use in site-specific integration into aneo-minichromosome or a SATAC. Such a modified CF-YAC can be introducedinto MAC-containing cells, such as EC3/7C5 or 19C5xHa4 cells, by fusionwith yeast protoplasts harboring the modified CF-YAC or microinjectionof yeast nuclei harboring the modified CF-YAC into the cells. Stabletransformants are then selected on the basis of antibiotic resistance.These transformants will carry the modified CF-YAC within the MACcontained in the cells.

3.6.8.2 Animals, Birds, Fish and Plants that are Genetically Altered toPossess Desired Traits such as Resistance to Disease

Artificial chromosomes are ideally suited for preparing animals,including vertebrates and invertebrates, including birds and fish aswell as mammals, that possess certain desired traits, such as, forexample, disease resistance, resistance to harsh environmentalconditions, altered growth patterns, and enhanced physicalcharacteristics.

One example of the use of artificial chromosomes in generatingdisease-resistant organisms involves the preparation of multivalentvaccines. Such vaccines include genes encoding multiple antigens thatcan be carried in a MAC, or species-specific artificial chromosome, andeither delivered to a host to induce immunity, or incorporated intoembryos to produce transgenic animals and plants that are immune or lesssusceptible to certain diseases.

Disease-resistant animals and plants may also be prepared in whichresistance or decreased susceptibility to disease is conferred byintroduction into the host organism or embryo of artificial chromosomescontaining DNA encoding gene products (e.g., ribozymes and proteins thatare toxic to certain pathogens) that destroy or attenuate pathogens orlimit access of pathogens to the host.

Animals and plants possessing desired traits that might, for example,enhance utility, processibility and commercial value of the organisms inareas such as the agricultural and ornamental plant industries may alsobe generated using artificial chromosomes in the same manner asdescribed above for production of disease-resistant animals and plants.In such instances, the artificial chromosomes that are introduced intothe organism or embryo contain DNA encoding gene products that serve toconfer he desired trait in the organism.

Birds, particularly fowl such as chickens, fish and crustaceans willserve as model hosts for production of genetically altered organismsusing artificial chromosomes.

3.6.83 Use of MACs and Other Artificial Chromosomes for Preparation andScreening of Libraries

Since large fragments of DNA can be incorporated into each artificialchromosome, the chromosomes are well-suited for use as cloning vehiclesthat can accommodate entire genomes in the preparation of genomic DNAlibraries, which then can be readily screened. For example, MACs may beused to prepare a genomic DNA library useful in the identification andisolation of functional centromeric DNA from different species oforganisms. In such applications, the MAC used to prepare a genomic DNAlibrary from particular organism is one that is not functional in cellsof that organism. That is, the MAC does not stably replicate, segregateor provide for expression of genes contained within it in cells of theorganism. Preferably, the MACs contain an indicator gene (e.g., the lacZgene encoding beta-galactosidase or genes encoding products that conferresistance to antibiotics such as neomycin, puromycin, hygromycin)linked to a promoter that is capable of promoting transcription of theindicator gene in cells of the organism. Fragments of genomic DNA fromthe organism are incorporated into the MACs, and the MACs aretransferred to cells from the organism. Cells that contain MACs thathave incorporated functional centromeres contained within the genomicDNA fragments are identified by detection of expression of the markergene. For example, DNA encoding tree growth factors can be introducedinto trees. Libraries can be prepared, introduce large fragments intochromosomes, and introduce them all into trees, thereby insuringexpression.

3.6.8.4 Use of MACs and Other Artificial Chromosomes for Stable,High-Level Protein Production

Cells containing the MACs and/or other artificial chromosomes providedherein are advantageously used for production of proteins, particularlyseveral proteins from one cell line, such as multiple proteins involvedin a biochemical pathway or multivalent vaccines. The genes encoding theproteins are introduced into the artificial chromosomes which are thenintroduced into cells. Alternatively, the heterologous gene(s) ofinterest are transferred into a production cell line that alreadycontains artificial chromosomes in a manner that targets the gene(s) tothe artificial chromosomes. The cells are cultured under conditionswhereby the heterologous proteins are expressed. Because the proteinswill be expressed at high levels in a stable permanent extra-genomicchromosomal system, selective conditions are not required.

Any transfectable cells capable of serving as recombinant hostsadaptable to continuous propagation in a cell culture system [see, e.g.,McLean (1993) Trends In Biotech. 11: 232-238] are suitable for use in anartificial chromosome-based protein production system. Exemplary hostcell lines include, but are not limited to, the following: Chinesehamster ovary (CHO) cells [see, e.g., Zang et al. (1995) Biotechnology13: 389-392], HEK 293, Ltk−, COS-7, DG44, and BHK cells. CHO cells areparticularly preferred host cells. Selection of host cell lines for usein artificial chromosome-based protein production systems is within theskill of the art, but often will depend on a variety of factors,including the properties of the heterologous protein to be produced,potential toxicity of the protein in the host cell, any requirements forpost-translational modification (e.g., glycosylation, amination,phosphorylation) of the protein, transcription factors available in thecells, the type of promoter element(s) being used to drive expression ofthe heterologous gene, whether production will be completelyintracellular or the heterologous protein will preferably be secretedfrom the cell, and the types of processing enzymes in the cell.

The artificial chromosome-based system for heterologous proteinproduction has many advantageous features. For example, as describedabove, because the heterologous DNA is located in an independent,extra-genomic artificial chromosome (as opposed to randomly inserted inan unknown area of the host cell genome or located as extrachromosomalelement(s) providing only transient expression) it is stably maintainedin an active transcription unit and is not subject to ejection viarecombination or elimination during cell division. Accordingly, it isunnecessary to include a selection gene in the host cells and thusgrowth under selective conditions is also unnecessary. Furthermore,because the artificial chromosomes are capable of incorporating largesegments of DNA, multiple copies of the heterologous gene and linkedpromoter element(s) can be retained in the chromosomes, therebyproviding for high-level expression of the foreign protein(s).Alternatively, multiple copies of the gene can be linked to a singlepromoter element and several different genes may be linked in a fusedpolygene complex to a single promoter for expression of, for example,all the key proteins constituting a complete metabolic pathway [see,e.g., Beck von Bodman et al. (1995) Biotechnology 13: 587-591].Alternatively, multiple copies of a single gene can be operativelylinked to a single promoter, or each or one or several copies may belinked to different promoters or multiple copies of the same promoter.Additionally, because artificial chromosomes have an almost unlimitedcapacity for integration and expression of foreign genes, they can beused not only for the expression of genes encoding end-products ofinterest, but also for the expression of genes associated with optimalmaintenance and metabolic management of the host cell, e.g., genesencoding growth factors, as well as genes that may facilitate rapidsynthesis of correct form of the desired heterologous protein product,e.g., genes encoding processing enzymes and transcription factors.

The MACS are suitable for expression of any proteins or peptides,including proteins and peptides that require in vivo posttranslationalmodification for their biological activity. Such proteins include, butare not limited to antibody fragments, full-length antibodies, andmultimeric antibodies, tumor suppressor proteins, naturally occurring orartificial antibodies and enzymes, heat shock proteins, and others.

Thus, such cell-based “protein factories” employing MACs can generatedusing MACs constructed with multiple copies [theoretically an unlimitednumber or at least up to a number such that the resulting MAC is aboutup to the size of a genomic chromosome] of protein-encoding genes withappropriate promoters, or multiple genes driven by a single promoter,i.e., a fused gene complex [such as a complete metabolic pathway inplant expression system; see, e.g., Beck von Bodman (1995) Biotechnology13: 587-591]. Once such MAC is constructed, it can be transferred to asuitable cell culture system, such as a CHO cell line in protein-freeculture medium [see, e.g., (1995) Biotechnology 13: 389-39] or otherimmortalized cell lines [see, e.g., (1993) TIBTECH 11: 232-238] wherecontinuous production can be established.

The ability of MACs to provide for high-level expression of heterologousproteins in host cells is demonstrated, for example, by analysis of theH1D3 and G3D5 cell lines described herein and deposited with the ECACC.Northern blot analysis of mRNA obtained from these cells reveals thatexpression of the hygromycin-resistance and beta-galactosidase genes inthe cells correlates with the amplicon number of the megachromosome(s)contained therein.

4. ENGINEERING APPROACHES

4.1.1 General Considerations

In one aspect, this invention applies the technical field of moleculargenetics to evolve the genomes of cells and organisms to acquire new andimproved properties.

Cells have a number of well-established uses in molecular biology. Forexample, cells are commonly used as hosts for manipulating DNA inprocesses such as transformation and recombination. Cells are also usedfor expression of recombinant proteins encoded by DNA transformed intothe cells. Some types of cells are also used as progenitors forgeneration of transgenic animals and plants. Although all of theseprocesses are now routine, in general, the genomes of the cells used inthese processes have evolved little from the genomes of natural cells,and particularly not toward acquisition of new or improved propertiesfor use in the above processes.

The traditional approach to artificial or forced molecular evolutionfocuses on optimization of an individual gene having a discrete andselectable phenotype. The strategy is to clone a gene, identify adiscrete function for the gene and an assay by which it can be selected,mutate selected positions in the gene (e.g., by error-prone PCR orcassette mutagenesis) and select variants of the gene for improvement inthe known function of the gene. A variant having improved function canthen be expressed in a desired cell type. This approach has a number oflimitations. First, it is only applicable to genes that have beenisolated and functionally characterized. Second, the approach is usuallyonly applicable to genes that have a discrete function. In other words,multiple genes that cooperatively confer a single phenotype cannotusually be optimized in this manner. Probably, most genes do have

explore a very limited number of the total number of permutations evenfor a single gene. For example, varying even ten positions in a proteinwith every possible amino acid would generate 20¹⁰ variants, which ismore than can be accommodated by existing methods of transfection andscreening.

In view of these limitations, the traditional approach is inadequate forimproving cellular genomes in many useful properties. For example, toimprove a cell's capacity to express a recombinant protein might requiremodification in any or all of a substantial number of genes, known andunknown, having roles in transcription, translation, posttranslationalmodification, secretion or proteolytic degradation, among others.Attempting individually to optimize even all the known genes having suchfunctions would be a virtually impossible task, let alone optimizinghitherto unknown genes which may contribute to expression in manners notyet understood.

The present invention pro, des inter alia novel methods for evolving thegenome of whole cells and organisms which overcome the difficulties andlimitations of prior methods.

This ability to evolve genes artificially is of fundamental importance.For example, cells have a number of well-established uses in molecularbiology, medicine and industrial processes. For example, cells arecommonly used as hosts for manipulating DNA in processes such astransformation and recombination. Cells are used for expression ofrecombinant proteins encoded by DNA transformed/transfected or otherwiseintroduced into the cells. Some types of cells are used as progenitorsfor generation of transgenic animals and plants. The genomes of thecells used in these processes had evolved little from the genomes ofnatural cells, and particularly not toward acquisition of new orimproved properties for use in the above processes.

Additional methods of recursively recombining nucleic acids in vivo andselecting resulting recombinants would be of use. The present inventionprovides a number of new and valuable methods and compositions for wholeand partial genome evolution.

Metabolic engineering is the manipulation of intermediary metabolismthrough the use of both classical genetics and genetic engineeringtechniques. Cellular engineering is generally a more inclusive termreferring to the modification of cellular properties. Cameron et al.(Applied Biochem. Biotech. 38: 105-140 (1993)) provide a summary ofequivalent terms to describe this type of engineering, including“metabolic engineering”, which is most often used in the context ofindustrial microbiology and bioprocess engineering, “in vitro evolution”or “directed evolution”, most often used in the context of environmentalmicrobiology, “molecular breeding”, most often used by Japaneseresearchers, “cellular engineering”, which is used to describemodifications of bacteria, animal, and plant cells, “rational straindevelopment”, and “metabolic pathway evolution”. In this application,the terms “metabolic engineering” and “cellular engineering” are usedpreferentially for clarity; the term “evolved” genes is used asdiscussed below.

Metabolic engineering can be divided into two basic categories:modification of genes endogenous to the host organism to altermetabolite flux and introduction of foreign genes into an organism. Suchintroduction can create new metabolic pathways leading to modified cellproperties including but not limited to synthesis of known compounds notnormally made by the host cell, production of novel compounds (e.g.polymers, antibiotics, etc.) and the ability to utilize new nutrientsources. Specific applications of metabolic engineering can include theproduction of specialty and novel chemicals, including antibiotics,extension of the range of substrates used for growth and productformation, the production of new catabolic activities in an organism fortoxic chemical degradation, and modification of cell properties such asresistance to salt and other environmental factors.

Bailey (Science 252: 1668-1674 (1991)) describes the application ofmetabolic engineering to the recruitment of heterologous genes for theimprovement of a strain, with the caveat that such introduction canresult in new compounds that may subsequently undergo further reactions,or that expression of a heterologous protein can result in proteolysis,improper folding, improper modification, or unsuitable intracellularlocation of the protein, or lack of access to required substrates.Bailey recommends careful configuration of a desired genetic change withminimal perturbation of the host. Liao (Curr. Opin. Biotech. 4: 211-216(1993)) reviews mathematical modeling and analysis of metabolicpathways, pointing out that in many cases the kinetic parameters ofenzymes are unavailable or inaccurate. Stephanopoulos et al. (Trends.Biotechnol. 11: 392-396 (1993)) describe attempts to improveproductivity of cellular systems or effect radical alteration of theflux through primary metabolic pathways as having difficulty in thatcontrol architectures at key branch points have evolved to resist fluxchanges. They conclude that identification and characterization of thesemetabolic nodes is a prerequisite to rational metabolic engineering.Similarly, Stephanopoulos (Curr. Opin. Biotech. 5: 196-200 (1994))concludes that rather than modifying the “rate limiting step” inmetabolic engineering, it is necessary to systematically elucidate thecontrol architecture of bioreaction networks. The present invention isgenerally directed to the evolution of new metabolic pathways and theenhancement of bioprocessing through a process herein termed recursivesequence recombination. Recursive sequence recombination entailsperforming iterative cycles of recombination and screening or selectionto “evolve” individual genes, whole plasmids or viruses, multigeneclusters, or even whole genomes (Stemmer, Bio/Technolog 13: 549-553(1995)). Such techniques do not require the extensive analysis andcomputation required by conventional methods for metabolic engineering.Recursive sequence recombination allows the recombination of largenumbers of mutations in a minimum number of selection cycles, incontrast to traditional, pair wise recombination events.

Thus, because metabolic and cellular engineering can pose the particularproblem of the interaction of many gene products and regulatorymechanisms, recursive sequence recombination (RSR) techniques provideparticular advantages in that they provide recombination betweenmutations in any or all of these, thereby providing a very fast way ofexploring the manner in which different combinations of mutations canaffect a desired result, whether that result is increased yield of ametabolite, altered catalytic activity or substrate specificity of anenzyme or an entire metabolic pathway, or altered response of a cell toits environment.

4.1.2 The Evolutionary Importance of Recombination

Strain improvement is the directed evolution of an organism to be more“fit” for a desired task. In nature, adaptation is facilitated by sexualrecombination. Sexual recombination allows a population to exploit thegenetic diversity within it, e.g., by consolidating useful mutations anddiscarding deleterious ones. In this way, adaptation and evolution canproceed in leaps. In the absence of a sexual cycle, members of apopulation must evolve independently by accumulating random mutationssequentially. Many useful mutations are lost while deleterious mutationscan accumulate. Adaptation and evolution in this way proceeds slowly ascompared to sexual evolution.

Asexual Evolution is a Slow and Inefficient Process.

Populations move as individuals rather than as groups. A diversepopulation is generated by the mutagenesis of a single parent resultingin a distribution of fit and unfit individuals. In the absence of asexual cycle, each piece of genetic information of the survivingpopulation remains in the individual mutants. Selection of the “fittest”results in many “fit” individuals being discarded along with the usefulgenetic information they carry. Asexual evolution proceeds one geneticevent at a time and is thus limited by the intrinsic value of a singlegenetic event. Sexual evolution moves more quickly and efficiently.Mating within a population consolidates genetic information within thepopulation and results in useful mutations being combined together. Thecombining of useful genetic information results in progeny that are muchmore fit than their parents. Sexual evolution thus proceeds much fasterby multiple genetic events.

Years of plant and animal breeding has demonstrated the power ofemploying sexual recombination to effect the rapid evolution of complexgenomes towards a particular task. This general principle is furtherdemonstrated by using DNA stochastic &/or non-stochastic mutagenesis torecombine DNA molecules in vitro to accelerate the rate of directedmolecular evolution. The strain improvement efforts of the fermentationindustry rely on the directed evolution of microorganisms by sequentialrandom mutagenesis. Incorporation of recombination into this iterativeprocess greatly accelerates the strain improvement process, which inturn increases the profitability of current fermentation processes andfacilitates the development of new products.

4.1.2.1 DNA Stochastic &/or Non-Stochastic Mutagenesis vs NaturalRecombination

DNA stochastic &/or non-stochastic mutagenesis includes the recursiverecombination of DNA sequences. A significant difference between DNAstochastic &/or non-stochastic mutagenesis and natural sexualrecombination is that DNA stochastic &/or non-stochastic mutagenesis canproduce DNA sequences originating from multiple parental sequences whilesexual recombination produces DNA sequences originating from only twoparental sequences.

The rate of evolution is in part limited by the number of usefulmutations that a member of a population can accumulate between selectionevents.

In sequential random mutagenesis, useful mutations are accumulated oneper selection event. Many useful mutations are discarded each cycle infavor of the best performer, and neutral or deleterious mutations whichsurvive are as difficult to lose as they were to gain and thusaccumulate. In sexual evolution pairwise recombination allows mutationsfrom two different parents to segregate and recombine in differentcombinations. Useful mutations can accumulate and deleterious mutationscan be lost. Poolwise recombination, such as that effected by DNAstochastic &/or non-stochastic mutagenesis, has the same advantages aspairwise recombination but allows mutations from many parents toconsolidate into a single progeny. Thus poolwise recombination providesa means for increasing the number of useful mutations that canaccumulate each selection event. One can plot the potential number ofmutations an individual can accumulate by each of these processes.Recombination is exponentially superior to sequential randommutagenesis, and this advantage increases exponentially with the numberof parents that can recombine. Sexual recombination is thus moreconservative. In nature, the pairwise nature of sexual recombination mayprovide important stability within a population by impeding the largechanges in DNA sequence that can result from poolwise recombination. Forthe purposes of directed evolution, however, poolwise recombination ismore efficient.

The potential diversity that can be generated from a population isgreater as a result of poolwise recombination as compared to thatresulting from pairwise recombination. Further, poolwise recombinationenables the combining of multiple beneficial mutations originating frommultiple parental sequences. To demonstrate the importance of poolwiserecombination vs pairwise recombination in the generation of moleculardiversity consider the breeding of ten independent DNA sequences eachcontaining only one unique mutation. There are 210=1024 differentcombinations of those ten mutations ranging from a single sequencehaving no mutations (the consensus) to that having all ten mutations. Ifthis pool were recombined together by pairwise recombination, apopulation containing the consensus, the parents, and the 45 differentcombinations of any two of the mutations would result in 56 or ca. 5% ofthe possible 1024 mutant combinations. Alternatively, if the pool wererecombined together in a poolwise fashion, all 1024 would betheoretically generated, resulting in an approximately 20 fold increasein library diversity. When looking for a unique solution to a problem inmolecular evolution, the more complex the library, the more complex thepossible solution. Indeed, the most fit member of a stochastic &/ornon-stochastic mutagenized library often contains several mutationsoriginating from several independent starting sequences.

4.1.2.2 DNA Stochastic &/or Non-Stochastic Mutagenesis ProvidesRecursive Pairwise Recombination

In vitro DNA stochastic &/or non-stochastic mutagenesis results in theefficient production of combinatorial genetic libraries by catalyzingthe recombination of multiple DNA sequences. While the result of DNAstochastic &/or non-stochastic mutagenesis is a population representingthe poolwise recombination of multiple sequences, the process does notrely on the recombination of multiple DNA sequences simultaneously, butrather on their recursive pairwise recombination. The assembly ofcomplete genes from a mixed pool of small gene fragments requiresmultiple annealing and elongation cycles, the thermal cycles of theprimeness PCR reaction. During each thermal cycle many pairs offragments anneal and are extended to form a combinatorial population oflarger chimeric DNA fragments. After the first cycle of stochastic &/ornon-stochastic mutagenesis, chimeric fragments contain sequenceoriginating from predominantly two different parent genes, with allpossible pairs of “parental” sequence theoretically represented. This issimilar to the result of a single sexual cycle within a population.During the second cycle, these chimeric fragments anneal with each otheror with other small fragments, resulting in chimeras originating from upto four of the different starting sequences, again with all possiblecombinations of the four parental sequences theoretically represented.This second cycle is analogous to the entire population resulting from asingle sexual cross, both parents and offspring, inbreeding.

Further cycles result in chimeras originating from 8, 16, 32, etcparental sequences and are analogous to further inbreedings of thepreceding population. This could be considered similar to the diversitygenerated from a small population of birds that are isolated on anisland, breeding with each other for many generations. The result mimicsthe outcome of “poolwise” recombination, but the path is via recursivepairwise recombination. For this reason, the DNA molecules generatedfrom in vitro DNA stochastic &/or non-stochastic mutagenesis are not the“progeny” of the starting “parental” sequences, but rather the great,great great, great_(n), (n=number of thermal cycles) grand progeny ofthe starting “ancestor” molecules.

4.1.3 Definitions

The term “cognate” refers to a gene sequence that is evolutionarily andfunctionally related between species. For example, in the human genome,the human CD4 gene is the cognate gene to the mouse CD4 gene, since thesequences and structures of these which functions in signaling T-cellactivation through MHC class II-restricted antigen recognition.Screening is, in general, a two-step process in which one firstdetermines which cells do and do not express a screening marker orphenotype (or a selected level of marker or phenotype), and thenphysically separates the cells having the desired property. Selection isa form of screening in which identification and physical separation areachieved simultaneously by expression of a selection marker, which, insome genetic circumstances, allows cells expressing the marker tosurvive while other cells die (or vice versa). Screening markers includeluciferase, P-galactosidase, and green fluorescent protein. Selectionmarkers include drug and toxin resistance genes.

An exogenous DNA segment is one foreign (or heterologous) to the cell orhomologous to the cell but in a position within the host cell nucleicacid in which the element is not ordinarily found. Exogenous DNAsegments can be expressed to yield exogenous polypeptides.

The term “gene” is used broadly to refer to any segment of DNAassociated with a biological function. Thus, genes include codingsequences and/or the regulatory sequences required for their expression.Genes also include nonexpressed DNA segments that, for example, formrecognition sequences for other proteins.

The terms “identical” or “percent identity,” in the context of two ormore nucleic acids or polypeptide sequences, refer to two or moresequences or subsequences that are the same or have a specifiedpercentage of amino acid residues or nucleotides that are the same, whencompared and aligned for maximum correspondence, as measured using oneof the following sequence comparison algorithms or by visual inspection.

The phrase “substantially identical,” in the context of two nucleicacids or polypeptides, refers to two or more sequences or subsequencesthat have at least 60%, preferably 80%, most preferably 90-95%nucleotide or amino acid residue identity, when compared and aligned formaximum correspondence, as measured using one of the following sequencecomparison algorithms or by visual inspection. Preferably, thesubstantial identity exists over a region of the sequences that is atleast about 50 residues in length, more preferably over a region of atleast about 100 residues, and most preferably the sequences aresubstantially identical over at least about 150 residues. In a mostpreferred embodiment, the sequences are substantially identical over theentire length of the coding regions.

For sequence comparison, typically one sequence acts as a referencesequence, to which test sequences are compared. When using a sequencecomparison algorithm, test and reference sequences are input into acomputer, subsequence coordinates are designated, if necessary, andsequence algorithm program parameters are designated. The sequencecomparison algorithm then calculates the percent sequence identity forthe test sequence(s) relative to the reference sequence, based on thedesignated program parameters.

Optimal alignment of sequences for comparison can be conducted, e.g., bythe local homology algorithm of Smith & Waterman, Adv. Appl Math. 2: 482(1981), by the homology alignment algorithm of Needleman & Wunsch, J MolBiot 48: 443 (1970), by the search for similarity method of Pearson &Lipman, Proc. Natl. Acad. Sci. USA 85: 2444 (1988), by computerizedimplementations of algorithms GAP, BESTFIT, FASTA, and TFASTA in theWisconsin Genetics Software Package Release 7.0, Genetics ComputerGroup, 575 Science Dr., Madison, W 1.

Another example of a useful alignment algorithm is PILEUP. PILEUPcreates a multiple sequence alignment from a group of related sequencesusing progressive, pairwise alignments to show relationship and percentsequence identity. It also plots a tree or dendogram showing theclustering relationships used to create the alignment. PILEUP uses asimplification of the progressive alignment method of Feng & Doolittle,J. Mol. Evol. 35: 351-360 (1987). The method used is similar to themethod described by Higgins & Sharp, CABIOS 5: 151-153 (1989). Theprogram can align up to 300 sequences, each of a maximum length of 5,000nucleotides or amino acids. The multiple alignment procedure begins withthe pairwise alignment of the two most similar sequences, producing acluster of two aligned sequences. This cluster is then aligned to thenext most related sequence or cluster of aligned sequences. Two clustersof sequences are aligned by a simple extension of the pairwise alignmentof two individual sequences. The final alignment is achieved by a seriesof progressive, pairwise alignments. The program is run by designatingspecific sequences and their amino acid or nucleotide coordinates forregions of sequence comparison and by designating the programparameters. For example, a reference sequence can be compared to othertest sequences to determine the percent sequence identity relationshipusing the following parameters: default gap weight (3.00), default gaplength weight (0.10), and weighted end gaps.

Another example of algorithm that is suitable for determining percentsequence identity and sequence similarity is the BLAST algorithm, whichis described in Altschul et al., J. Mol. Biol. 215: 403410 (1990).Software for performing BLAST analyses is publicly available through theNational Center for Biotechnology Information(http://www.ncbi.nim.nih.gov/). This algorithm involves firstidentifying high scoring sequence pairs (HSPs) by identifying shortwords of length W in the query sequence, which either match or satisfysome positive-valued threshold score T when aligned with a word of thesame length in a database sequence. T is referred to as the neighborhoodword score threshold (Altschul et al, supra). These initial neighborhoodword hits act as seeds for initiating searches to find longer HSPscontaining them. The word hits are then extended in both directionsalong each sequence for as far as the cumulative alignment score can beincreased. Cumulative scores are calculated using, for nucleotidesequences, the parameters M (reward score for a pair of matchingresidues; always >0) and N (penalty score for mismatching residues;always <0). For amino acid sequences, a scoring matrix is used tocalculate the cumulative score. Extension of the word hits in eachdirection are halted when: the cumulative alignment score falls off bythe quantity X from its maximum achieved value; the cumulative scoregoes to zero or below, due to the accumulation of one or morenegative-scoring residue alignments; or the end of either sequence isreached. The BLAST algorithm parameters W, T, and X determine thesensitivity and speed of the alignment. The BLASTN program (fornucleotide sequences) uses as defaults a word length (W) of 11, anexpectation (E) of 10, M=5, N=4, and a comparison of both strands. Foramino acid sequences, the BLASTP program uses as defaults a word length(W) of 3, an expectation (E) of 10, and the BLOSUM62 scoring matrix (seeHenikoff& Henikoff, Proc. Natl. Acad. Sci. USA 89: 10915 (1989)).

In addition to calculating percent sequence identity, the BLASTalgorithm also performs a statistical analysis of the similarity betweentwo sequences (see, e.g., Karlin & Altschul, Proc. Natl. Acad. Sci. USA90: 5873-5787 (1993)). One measure of similarity provided by the BLASTalgorithm is the smallest sum probability (P(N)), which provides anindication of the probability by which a match between two nucleotide oramino acid sequences would occur by chance. For example, a nucleic acidis considered similar to a reference sequence if the smallest sumprobability in a comparison of the test nucleic acid to the referencenucleic acid is less than about 0.1, more preferably less than about0.01, and most preferably less than about 0.001.

A further indication that two nucleic acid sequences or polypeptides aresubstantially identical is that the polypeptide encoded by the firstnucleic acid is immunologically cross reactive with the polypeptideencoded by the second nucleic acid, as described below. Thus, apolypeptide is typically substantially identical to a secondpolypeptide, for example, where the two peptides differ only byconservative substitutions.

Another indication that two nucleic acid sequences are substantiallyidentical is that the two molecules hybridize to each other understringent conditions.

The term “naturally-occurring” is used to describe an object that can befound in nature. For example, a polypeptide or polynucleotide sequencethat is present in an organism (including viruses) that can be isolatedfrom a source in nature and which has not been intentionally modified byman in the laboratory is naturally-occurring. Generally, the termnaturally-occurring refers to an object as present in a non-pathological(undiseased) individual, such as would be typical for the species.

Asexual recombination is recombination occurring without the fusion ofgametes to form a zygote.

A “mismatch repair deficient strain” can include any mutants in anyorganism impaired in the functions of mismatch repair. These includemutant gene products of mutS, mutT, muth, mutL, ovrD, dcm, vsr, umuC,umuD, sbcB, recj, etc. The impairment is achieved by genetic mutation,allelic replacement, selective inhibition by an added reagent such as asmall compound or an expressed antisense RNA, or other techniques.Impairment can be of the genes noted, or of homologous genes in anyorganism.

4.2. Strategies

4.2.1. Evolving a Cell to Acquire a Desired Function

4.2.1.1. Desired Function is Secretion of a Protein

Optionally, the desired function is secretion of a protein, and theplurality of cells further comprises a construct encoding the protein.The protein is optionally inactive unless secreted, and further modifiedcells are optionally selected for protein function.

Optionally, the protein is toxic to the plurality of cells, unlesssecreted. In this case, the modified or further modified cells whichevolve toward acquisition of the desired function are screened bypropagating the cells and recovering surviving cells.

4.2.1.2. Desired Function is Enhanced Recombination

In some methods, the desired function is enhanced recombination. In suchmethods, the library of fragments sometimes comprises a cluster of genescollectively conferring recombination capacity. Screening can beachieved using cells carrying a gene encoding a marker whose expressionis prevented by a mutation removable by recombination. The cells arescreened by their expression of the marker resulting from removal of themutation by recombination.

4.2.13. Desired Function is Improved Resistance in Plant Cells

In some methods, the plurality of cells are plant cells and the desiredproperty is improved resistance to a chemical or microbe. The modifiedor further modified cells (or whole plants) are exposed to the chemicalor microbe and modified or further modified cells having evolved towardthe acquisition of the desired function are selected by their capacityto survive the exposure.

4.2.1.4. Desired Function is Predictiong Efficacy of a Drug

4.2.1.4.1. A Drug Treating a Viral Infection

The invention further provides methods of predicting efficacy of a drugin treating a viral infection. Such methods entail recombining a nucleicacid segment from a virus, whose infection is inhibited by a drug, withat least a second nucleic acid segment from the virus, the secondnucleic acid segment differing from the first nucleic acid segment in atleast two nucleotides, to produce a library of recombinant nucleic acidsegments. Host cells are then contacted with a collection of viruseshaving genomes including the recombinant nucleic acid segments in amedia containing the drug, and progeny viruses resulting from infectionof the host cells are collected.

A recombinant DNA segment from a first progeny virus recombines with atleast a recombinant DNA segment from a second progeny virus to produce afurther library of recombinant nucleic acid segments. Host cells arecontacted with a collection of viruses having genomes including thefurther library or recombinant nucleic acid segments, in mediacontaining the drug, and further progeny viruses are produced by thehost cells. The recombination and selection steps are repeated, asdesired, until a further progeny virus has acquired a desired degree ofresistance to the drug, whereby the degree of resistance acquired andthe number of repetitions needed to acquire it provide a measure of theefficacy of the drug in treating the virus. Viruses are optionallyadapted to grow on particular cell lines.

4.2.1.4.2. A Drug Treating Infection by a Pathogenic Microorganism

The invention further provides methods of predicting efficacy of a drugin treating an infection by a pathogenic microorganism. These methodsentail delivering a library of DNA fragments into a plurality ofmicroorganism cells, at least some of which undergo recombination withsegments in the genome of the cells to produce modified microorganismcells. Modified microorganisms are propagated in a media containing thedrug, and surviving microorganisms are recovered. DNA from survivingmicroorganisms is recombined with a further library of DNA fragments atleast some of which undergo recombination with cognate segments in theDNA from the surviving microorganisms to produce further modifiedmicroorganisms cells. Further modified microorganisms are propagated inmedia containing the drug, and further surviving microorganisms arecollected. The recombination and selection steps are repeated as needed,until a further surviving microorganism has acquired a desired degree ofresistance to the drug. The degree of resistance acquired and the numberof repetitions needed to acquire it provide a measure of the efficacy ofthe drug in killing the pathogenic microorganism.

4.2.1.3. Method

4.2.1.3.1 Modify or Recombine Cells

In one aspect, the invention provides methods of evolving a cell toacquire a desired function. Such methods entail, e.g., introducing alibrary of DNA fragments into a plurality of cells, whereby at least oneof the fragments undergoes recombination with a segment in the genome oran episome of the cells to produce modified cells. Optionally, thesemodified cells are bred to increase the diversity of the resultingrecombined cellular population. The modified cells, or the recombinedcellular population are then screened for modified or recombined cellsthat have evolved toward acquisition of the desired function. DNA fromthe modified cells that have evolved toward the desired function is thenoptionally recombined with a further library of DNA fragments, at leastone of which undergoes recombination with a segment in the genome or theepisome of the modified cells to produce further modified cells. Thefurther modified cells are then screened for further modified cells thathave further evolved toward acquisition of the desired function. Stepsof recombination and screening/selection are repeated as required untilthe further modified cells have acquired the desired function. In onepreferred embodiment, modified cells are recursively recombined toincrease diversity of the cells prior to performing any selection stepson any resulting cells.

4.2.1.3.2 Coat with RecA

In some methods, the library or further library of DNA fragments iscoated with recA protein to stimulate recombination with the segment ofthe genome. The library of fragments is optionally denatured to producesingle-stranded DNA, which are annealed to produce duplexes, some ofwhich contain mismatches at points of variation in the fragments.Duplexes containing mismatches are optionally selected by affinitychromatography to immobilized MutS.

4.2.1.33 Perform In Vivo Recombination

The invention further provides methods for performing in vivorecombination. At least first and second segments from at least one geneare introduced into a cell, the segments differing from each other in atleast two nucleotides, whereby the segments recombine to produce alibrary of chimeric genes. A chimeric gene is selected from the libraryhaving acquired a desired function.

The invention further provides methods of evolving a cell to acquire adesired function. These methods entail providing a populating ofdifferent cells. The cells are cultured under conditions whereby DNA isexchanged between cells, forming cells with hybrid genomes. The cellsare then screened or selected for cells that have evolved towardacquisition of a desired property. The DNA exchange andscreening/selecting steps are repeated, as needed, with thescreened/selected cells from one cycle forming the population ofdifferent cells in the next cycle, until a cell has acquired the desiredproperty.

Mechanisms of DNA exchange include conjugation, phage-mediatedtransduction, liposome delivery, protoplast fusion, and sexualrecombination of the cells. Optionally, a library of DNA fragments canbe transformed or electroporated into the cells.

4.2.13.4 Protoplast-Mediated Exchange

As noted, some methods of evolving a cell to acquire a desired propertyare effected by protoplast-mediated exchange of DNA between cells. Suchmethods entail forming protoplasts of a population of different cells.The protoplasts are then fused to form hybrid protoplasts, in whichgenomes from the protoplasts recombine to form hybrid genomes. Thehybrid protoplasts are incubated under conditions promoting regenerationof cells. The regenerated cells can be recombined one or more times(i.e., via protoplasting or any other method than combines genomes ofcells) to increase the diversity of any resulting cells. Preferably,regenerated cells are recombined several times, e.g., by protoplastfusion to generate a diverse population of cells. The next step is toselect or screen to isolate regenerated cells that have evolved towardacquisition of the desired property. DNA exchange andselection/screening steps are repeated, as needed, with regeneratedcells in one cycle being used to form protoplasts in the next cycleuntil the regenerated cells have acquired the desired property.Industrial microorganisms are a preferred class of organisms forconducting the above methods. Some methods further comprise a step ofselecting or screening for fused protoplasts free from unfusedprotoplasts of parental cells. Some methods further comprise a step ofselecting or screening for fused protoplasts with hybrid genomes freefrom cells with parental genomes. In some methods, protoplasts areprovided by treating individual cells, mycelia or spores with an enzymethat degrades cell walls. In some methods, the strain is a mutant thatis lacking capacity for intact cell wall synthesis, and protoplasts formspontaneously. In some methods, protoplasts are formed by treatinggrowing cells with an inhibitor of cell wall formation to generateprotoplasts. In some methods, the desired property is expression and/orsecretion of a protein or secondary metabolite, such as an industrialenzyme, a therapeutic protein, a primary metabolite such as lactic acidor ethanol, or a secondary metabolite such as erythromycin cyclosporin Aor taxol. In other methods it is the ability of the cell to convertcompounds provided to the cell to different compounds. In yet othermethods, the desired property is capacity for meiosis. In some methods,the desired property is compatibility to form a heterokaryon withanother strain.

The invention further provides methods of evolving a cell towardacquisition of a desired property. These methods entail providing apopulation of different cells. DNA is isolated from a firstsubpopulation of the different cells and encapsulated in liposomes.Protoplasts are formed from a second subpopulation of the differentcells. Liposomes are fused with the protoplasts, whereby DNA from theliposomes is taken up by the protoplasts and recombines with the genomesof the protoplasts. The protoplasts are incubated under regeneratingconditions. Regenerating or regenerated cells are then selected orscreened for evolution toward the desired property.

4.2.1.3.4 Reiterative Pooling and Breeding of higher Organisms

The method also provides methods of reiterative pooling and breeding ofhigher organisms. In the methods, a library of diverse multicellularorganisms are produced (e.g., plants, animals or the like). A pool ofmale gametes is provided along with a pool of female gametes. At leastone of the male pool or the female pool comprises a plurality ofdifferent gametes derived from different strains of a species ordifferent species. The male gametes are used to fertilize the femalegametes. At least a portion of the resulting fertilized gametes growinto reproductively viable organisms. These reproductively viableorganisms are crossed (e.g., by pairwise pooling and joining of the maleand female gametes as before) to produce a library of diverse organisms.The library is then selected for a desired trait or property.

The library of diverse organisms can comprise a plurality of plants suchas Gramineae, Fetucoideae, Poacoideae, Agrostis, Phleum, Dactylis,Sorgum, Setaria, Zea, Oryza, Triticum, Secale, Avena, Hordeum,Saccharum, Poa, Festuca, Stenotaphrum, Cynodon, Coix, Olyreae, Phareae,Compositae or Leguminosae. For example, the plants can be e.g., corn,rice, wheat, rye, oats, barley, pea, beans, lentil, peanut, yam bean,cowpeas, velvet beans, soybean, clover, alfalfa, lupine, vetch, lotus,sweet clover, wisteria, sweet pea, sorghum, millet, sunflower, canola orthe like.

Similarly, the library of diverse organisms can include a plurality ofanimals such as non-human mammals, fish, insects, or the like.

Optionally, a plurality of selected library members can be crossed bypooling gametes from the selected members and repeatedly crossing anyresulting additional reproductively viable organisms to produce a secondlibrary of diverse organisms (e.g., by split pair wise pooling andrejoining of the male and female gametes). Here again, the secondlibrary can be selected for a desired trait or property, with theresulting selected members forming the basis for additional poolwisebreeding and selection. A feature of the invention is the libraries madeby these (or any preceding) method.

4.3. Origin of Cells

4.3.1 Embryonic Cells of an Animal

In some methods, the plurality of cells are embryonic cells of ananimal, and the method further comprises propagating the transformedcells to transgenic animals. The plurality of cells can be a pluralityof industrial microorganisms that are enriched for microorganisms whichare tolerant to desired process conditions (heat, light, radiation,selected pFL presence of detergents or other denaturants, presence ofalcohols or other organic molecules, etc.).

4.3.2 Artificial Chromosomes

The invention further provides methods of evolving a cell towardacquisition of a desired property using artificial chromosomes. Suchmethods entail introducing a DNA fragment library cloned into anartificial chromosome into a population of cells. The cells are thencultured under conditions whereby sexual recombination occurs betweenthe cells, and DNA fragments cloned into the artificial chromosomerecombines by homologous recombination with corresponding segments ofendogenous chromosomes of the populations of cells, and endogenouschromosomes recombine with each other. Cells can also be recombined viaconjugation. Any resulting cells can be recombined via any method notedherein, as many times as desired, to generate a desired level ofdiversity in the resulting recombinant cells. In any case, aftergenerating a diverse library of cells, the cells that have evolvedtoward acquisition of the desired property are screened and/or selectedfor a desired property. The method is then repeated with cells that haveevolved toward the desired property in one cycle forming the populationof different cells in the next cycle. Here again, multiple cycles of invivo recombination are optionally performed prior to any additionalselection or screening steps.

The invention further provides methods of evolving a DNA segment clonedinto an artificial chromosome for acquisition of a desired property.These methods entail providing a library of variants of the segment,each variant cloned into separate copies of an artificial chromosome.The copies of the artificial chromosome are introduced into a populationof cells. The cells are cultured under conditions whereby sexualrecombination occurs between cells and homologous recombination occursbetween copies of the artificial chromosome bearing the variants.Variants are then screened or selected for evolution toward acquisitionof the desired property. The invention further provideshyperrecombinogenic recA proteins.

4.4. Method to Acquire a Biocatalytic Activity

One aspect of the invention is a method of evolving a biocatalyticactivity of a cell, comprising:

-   (a) recombining at least a first and second DNA segment from at    least one gene conferring ability to catalyze a reaction of    interest, the segments differing from each other in at least two    nucleotides, to produce a library of recombinant genes;-   (b) screening at least one recombinant gene from the library that    confers enhanced ability to catalyze the reaction of interest by the    cell relative to a wild type form of the gene;-   (c) recombining at least a segment from at least one recombinant    gene with a further DNA segment from at least one gene, the same or    different from the first and second segments, to produce a further    library of recombinant genes;-   (d) screening at least one further recombinant gene from the further    library of recombinant genes that confers enhanced ability to    catalyze the reaction of interest in the cell relative to a previous    recombinant gene;-   (e) repeating (c) and (d), as necessary, until the further    recombinant gene confers a desired level of enhanced ability to    catalyze the reaction of interest by the cell.    4.4.1. Method to Evolve a Gene to Catalyze a RXN of Interest

Another aspect of the invention is a method of evolving a gene to conferability to catalyze a reaction of interest, the method comprising:

-   (1) recombining at least first and second DNA segments from at least    one gene conferring ability to catalyze a reaction of interest, the    segments differing from each other in at least two nucleotides, to    produce a library of recombinant genes;-   (2) screening at least one recombinant gene from the library that    confers enhanced ability to catalyze a reaction of interest relative    to a wild type form of the gene;-   (3) recombining at least a segment from the at least one recombinant    gene with a further DNA segment from the at least one gene, the same    or different from the first and second segments, to produce a    further library of recombinant genes;-   (4) screening at least one further recombinant gene from the further    library of recombinant genes that confers enhanced ability to    catalyze a reaction of interest relative to a previous recombinant    gene;-   (5) repeating (3) and (4), as necessary, until the further    recombinant gene confers a desired level of enhanced ability to    catalyze a reaction of interest.    4.4.2. Method to Generate a New Biocatalytic Activity in a Cell

A further aspect of the invention is a method of generating a newbiocatalytic activity in a cell, comprising:

-   (1) recombining at least first and second DNA segments from at least    one gene conferring ability to catalyze a first reaction related to    a second reaction of interest, the segments differing from each    other in at least two nucleotides, to produce a library of    recombinant genes;-   (2) screening at least one recombinant gene from the library that    confers a new ability to catalyze the second reaction of interest;-   (3) recombining at least a segment from at least one recombinant    gene with a further DNA segment from the at least one gene, the same    or different from the first and second segments, to produce a    further library of recombinant genes;-   (4) screening at least one further recombinant gene from the further    library of recombinant genes that confers enhanced ability to    catalyze the second reaction of interest in the cell relative to a    previous recombinant gene;-   (5) repeating (3) and (4), as necessary, until the further    recombinant gene confers a desired level of enhanced ability to    catalyze the second reaction of interest in the cell.    4.4.3. Method to Modify a Metabolic Pathway Evolved by Recursive    Sequence Recombination

Another aspect of the invention is a modified form of a cell, whereinthe modification comprises a metabolic pathway evolved by recursivesequence recombination.

A further aspect of the invention is a method of optimizing expressionof a gene product, the method comprising:

-   (1) recombining at least first and second DNA segments from at least    one gene conferring ability to produce the gene product, the    segments differing from each other in at least two nucleotides, to    produce a library of recombinant genes;-   (2) screening at least one recombinant gene from the library that    confers optimized expression of the gene product relative to a wild    type form of the gene;-   (3) recombining at least a segment from the at least one recombinant    gene with a further DNA segment from the at least one gene, the same    or different from the first and second segments, to produce a    further library of recombinant genes;-   (4) screening at least one further recombinant gene from the further    library of recombinant genes that confers optimized ability to    produce the gene product relative to a previous recombinant gene;-   (5) repeating (3) and (4), as necessary, until the further    recombinant gene confers a desired level of optimized ability to    express the gene product.    4.4.4. Method to Evolve a Biosensor

A further aspect of the invention is a method of evolving a biosensorfor a compound A of interest, the method comprising:

-   (1) recombining at least first and second DNA segments from at least    one gene conferring ability to detect a related compound B, the    segments differing from each other in at least two nucleotides, to    produce a library of recombinant genes;-   (2) screening at least one recombinant gene from the library that    confers optimized ability to detect compound A relative to a wild    type form of the gene;-   (3) recombining at least a segment from the at least one recombinant    gene with a further DNA segment from the at least one gene, the same    or different from the first and second segments, to produce a    further library of recombinant genes;-   (4) screening at least one further recombinant gene from the further    library of recombinant genes that confers optimized ability to    detect compound A relative to a previous recombinant gene;-   (5) repeating (3) and (4), as necessary, until the further    recombinant gene confers a desired level of optimized ability to    detect compound A.    4.5. Fermentation of Micro-Organisms

The fermentation of microorganisms for the production of naturalproducts is the oldest and most sophisticated application ofbiocatalysis. Industrial microorganisms effect the multistep conversionof renewable feedstocks to high value chemical products in a singlereactor and in so doing catalyze a multi-billion dollar industry.Fermentation products range from fine and commodity chemicals such asethanol, lactic acid, amino acids and vitamins, to high value smallmolecule pharmaceuticals, protein pharmaceuticals, and industrialenzymes. (See, e.g., McCoy (1998) C&EN 13-19) for an introduction tobiocatalysis.

The methods herein allow biocatalysts to be improved at a faster pacethan conventional methods. Whole genome stochastic &/or non-stochasticmutagenesis can at least double the rate of strain improvement formicroorganisms used in fermentation as compared to traditional methods.This provides for a relative decrease in the cost of fermentationprocesses. New products can enter the market sooner, producers canincrease profits as well as market share, and consumers gain access tomore products of higher quality and at lower prices. Further, increasedefficiency of production processes translates to less waste productionand more frugal use of resources. Whole genome stochastic &/ornon-stochastic mutagenesis provides a means of accumulating multipleuseful mutation per cycle and thus eliminate the inherent limitation ofcurrent strain improvement programs (SIPs).

One key to SIP is having an assay that can be dependably used toidentify a few mutants out of thousands that have subtle increases inproduct yield. The limiting factor in many assay formats is theuniformity of cell growth. This variation is the source of baselinevariability in subsequent assays. Inoculum size and culture environment(temperature/humidity) are sources of cell growth variation. Automationof all aspects of establishing initial cultures and state-of-the-arttemperature and humidity controlled incubators are useful in reducingvariability.

Mutant cells or spores are separated on solid media to produceindividual sporulating colonies. Using an automated colony picker(Q-bot, Genefix, U.K.), colonies are identified, picked, and 10,000different mutants inoculated into 96 well microtitre dishes containingtwo 3 mm. glass balls/well. The Q-bot does not pick an entire colony butrather inserts a pin through the center of the colony and exits with asmall sampling of cells (or mycelia) and spores. The time the pin is inthe colony, the number of dips to inoculate the culture medium, and thetime the pin is in that medium each effect inoculum size, and each canbe controlled and optimized. The uniform process of the Q-bot decreaseshuman handling error and increases the rate of establishing cultures(roughly 10,000/4 hours). These cultures are then shaken in atemperature and humidity controlled incubator. The glass balls act topromote uniform aeration of cells and the dispersal of mycelialfragments similar to the blades of a fermenter.

1. Prescreen The ability to detect a subtle increase in the performanceof a mutant over that of a parent strain relies on the sensitivity ofthe assay. The chance of finding the organisms having an improvement isincreased by the number of individual mutants that can be screened bythe assay. To increase the chances of identifying a pool of sufficientsize a prescreen that increases the number of mutants processed by10-fold can be used. The goal of the primary screen will be to quicklyidentify mutants having equal or better product titres than the parentstrain(s) and to move only these mutants forward to liquid cell culture.The primary screen is an agar plate screen is analyzed by the Q-botcolony picker. Although assays can be fundamentally different, manyresult, e.g., in the production of colony halos. For example, antibioticproduction is assayed on plates using an overlay of a sensitiveindicator strain, such as B. subtilis. Antibiotic production istypically assayed as a zone of clearing (inhibited growth of theindicator organism) around the producing organism. Similarly, enzymeproduction can be assayed on plates containing the enzyme substrate,with activity being detected as a zone of substrate modification aroundthe producing colony. Product titre is correlated with the ratio of haloarea to colony area.

The Q-bot or other automated system is instructed to only pick colonieshaving a halo ratio in the top 10% of the population i.e. 10,000 mutantsfrom the 100,000 entering the plate prescreen. This increases the numberof improved clones in the secondary assay and eliminates the wastedeffort of screening knock-out and low producers. This improves the “hitrate” of the secondary assay.

4.6. Experimental Applications

4.6.1 Stochastic &/or Non-Stochastic Mutagenesis

4.6.1.1 General Techniques

4.6.1.1.1 Starting Materials

Thus, a general method for recursive sequence recombination for theembodiments herein is to begin with a gene encoding an enzyme or enzymesubunit and to evolve that gene either for ability to act on a newsubstrate, or for enhanced catalytic properties with an old substrate,either alone or in combination with other genes in a multistep pathway.The term “gene” is used herein broadly to refer to any segment orsequence of DNA associated with a biological function. Genes can beobtained from a variety of sources, including cloning from a source ofinterest or synthesizing from known or predicted sequence information,and may include sequences designed to have desired parameters. Theability to use a new substrate can be assayed in some instances by theability to grow on a substrate as a nutrient source. In othercircumstances such ability can be assayed by decreased toxicity of asubstrate for a host cell, hence allowing the host to grow in thepresence of that substrate. Biosynthesis of new compounds, such asantibiotics, can be assayed similarly by growth of an indicator organismin the presence of the host expressing the evolved genes. For example,when an indicator organism used in an overlay of the host expressing theevolved gene(s), wherein the indicator organism is sensitive or expectedto be sensitive to the desired antibiotic, growth of the indicatororganism would be inhibited in a zone around the host cell or colonyexpressing the evolved gene(s).

Another method of identifying new compounds is the use of standardanalytical techniques such as mass spectroscopy, nuclear magneticresonance, high performance liquid chromatography, etc. Recombinantmicroorganisms can be pooled and extracts or media supernatants assayedfrom these pools. Any positive pool can then be subdivided and theprocedure repeated until the single positive is identified(“sib-selection”).

In some instances, the starting material for recursive sequencerecombination is a discrete gene, cluster of genes, or family of genesknown or thought to be associated with metabolism of a particular classof substrates. One of the advantages of the instant invention is thatstructural information is not required to estimate which parts of asequence should be mutated to produce a functional hybrid enzyme.

In some embodiments of the invention, an initial screening of enzymeactivities in a particular assay can be useful in identifying candidateenzymes as starting materials. For example, high throughput screeningcan be used to screen enzymes for dioxygenase type activities usingaromatic acids as substrates. Dioxygenases typically transformindole-2-carboxylate and indole-3-carboxylate to colored products,including indigo (Eaton et. al. J. Bacteriol. 177: 6983-6988 (1995)).DNA encoding enzymes that give some activity in the initial assay canthen be recombined by the recursive techniques of the invention andrescreened. The use of such initial screening for candidate enzymesagainst a desired target molecule or analog of the target molecule canbe especially useful to generate enzymes that catalyze reactions ofinterest such as catabolism of man-made pollutants.

This type of high throughput screening can also be used during eachround of recursive sequence recombination to identify mutants thatpossess the highest level of the desired activity. For example,penicillin G acylases have been isolated by looking for clones thatallow a leucine auxotroph to hydrolyse penicillin G analoguephenylacetyl-L-leucine, thereby producing leucine and allowing cellgrowth (Martin, L. et al., FEMS Microbiology Lett. 125: 287-292 (1995)).Positives from this selection are then screened by a morelabour-intensive method for ability to hydrolyse penicillin G.

This same selection on phenylacetyl-L-leucine can be used when evolvingpenicillin G acylase for greater activity by recursive sequencerecombination. After each round of recombination the library of acylasegenes is transformed into a leucine auxotroph. Those that grow fastestare picked as probably having the most active acylase. The acylases arethen be tested against the real substrate, penicillin G, by a morelaborious screen such as HPLC. Thus, even if there is no convenient highthroughput screen for an enzyme or a metabolic pathway, it is oftenpossible to find a rapid detection method that can approximately measurethe desired phenotype, thereby reducing the numbers of colonies thatmust be screened more accurately.

The starting material can also be a segment of such a gene or clusterthat is recombined in isolation of its surrounding DNA, but is relinkedto its surrounding DNA before screening/selection of recombinationproducts. In other instances, the starting material for recombination isa larger segment of DNA that includes a coding sequence or other locusassociated with metabolism of a particular substrate at an unknownlocation. For example, the starting material can be a chromosome,episome, YAC, cosmid, or phage P1 clone. In still other instances, thestarting material is the whole genome of an organism that is known tohave desirable metabolic properties, but for which no informationlocalizing the genes associated with these characteristics is available.

In general any type of cells can be used as a recipient of evolvedgenes. Cells of particular interest include many bacterial cell types,both gram-negative and gram-positive, such as Rhodococcus,Streptomycetes, Actinomycetes, Corynebacteria, Penicillium, Bacillus,Escherichia coli, Pseudomonas, Salmonella, and Erwinia. Cells ofinterest also include eukaryotic cells, particularly mammalian cells(e.g., mouse, hamster, primate, human), both cell lines and primarycultures. Such cells include stem cells, including embryonic stem cells,zygotes, fibroblasts, lymphocytes, Chinese hamster ovary (CHO), mousefibroblasts (NIHM), kidney, liver, muscle, and skin cells. Othereukaryotic cells of interest include plant cells, such as maize, rice,wheat, cotton, soybean, sugarcane, tobacco, and arabidopsis; fish,algae, fungi (Penicillium, Fusarium, Aspergillus, Podospora,Neurospora), insects, yeasts (Picchia and Saccharomyces).

The choice of host will depend on a number of factors, depending on theintended use of the engineered host, including pathogenicity, substraterange, environmental hardiness, presence of key intermediates, ease ofgenetic manipulation, and likelihood of promiscuous transfer of geneticinformation to other organisms. Particularly advantageous hosts are E.coli, lactobacilli, Streptomycetes, Actinomycetes and filamentous fungi.

The breeding procedure starts with at least two substrates, whichgenerally show substantial sequence identity to each other (i.e., atleast about 50%, 70%, 80% or 90% sequence identity) but differ from eachother at certain positions. The difference can be any type of mutation,for example, substitutions, insertions and deletions. Often, differentsegments differ from each other in perhaps 5-20 positions. Forrecombination to generate increased diversity relative to the startingmaterials, the starting materials must differ from each other in atleast two nucleotide positions. That is, if there are only twosubstrates, there should be at least two divergent positions. If thereare three substrates, for example, one substrate can differ from thesecond as a single position, and the second can differ from the third ata different single position. The starting DNA segments can be naturalvariants of each other, for example, allelic or species variants. Thesegments can also be from nonallelic genes showing some degree ofstructural and is usually functional relatedness (e.g., different geneswithin a superfamily such as the immunoglobulin superfamily). Thestarting DNA segments can also be induced variants of each other. Forexample, one DNA segment can be produced by error-prone PCR replicationof the other, or by substitution of a mutagenic cassette. Inducedmutants can also be prepared by propagating one (or both) of thesegments in a mutagenic strain. In these situations, strictly speaking,the second DNA segment is not a single segment but a large family ofrelated segments. The different segments forming the starting materialsare often the same length or substantially the same length. However,this need not be the case; for example; one segment can be a subsequenceof another. The segments can be present as part of larger molecules,such as vectors, or can be in isolated form. The starting DNA segmentsare recombined by any of the recursive sequence recombination formatsdescribed above to generate a diverse library of recombinant DNAsegments.

Such a library can vary widely in size from having fewer than to morethan 10⁵, 10⁷, or 10⁹ members. In general, the starting segments and therecombinant libraries generated include full-length coding sequences andany essential regulatory sequences, such as a promoter andpolyadenylation sequence, required for expression. However, if this isnot the case, the recombinant DNA segments in the library can beinserted into a common vector providing the missing sequences beforeperforming screening/selection.

If the recursive sequence recombination format employed is an in vivoformat, the library of recombinant DNA segments generated already existsin a cell, which is usually the cell type in which expression of theenzyme with altered substrate specificity is desired. If recursivesequence recombination is performed in vitro, the recombinant library ispreferably introduced into the desired cell type beforescreening/selection. The members of the recombinant library can belinked to an episome or virus before introduction or can be introduceddirectly. In some embodiments of the is invention, the library isamplified in a first host, and is then recovered from that host andintroduced to a second host more amenable to expression, selection, orscreening, or any other desirable parameter. The manner in which thelibrary is introduced into the cell type depends on the DNA-uptakecharacteristics of the cell type, e.g., having viral receptors, beingcapable of conjugation, or being naturally competent. If the cell typeis insusceptible to natural and chemical-induced competence, butsusceptible to electroporation, one would usually employelectroporation. If the cell type is insusceptible to electroporation aswell, one can employ biolistics. The biolistic PDS-1000 Gene Gun(Biorad, Hercules, Calif.) uses helium pressure to accelerate DNA-coatedgold or tungsten microcarriers toward target cells.

The process is applicable to a wide range of tissues, including plants,bacteria, fungi, algae, intact animal tissues, tissue culture cells, andanimal embryos. One can employ electronic pulse delivery, which isessentially a mild electroporation format for live tissues in animalsand patients. Zhao, Advanced Drug Delivery Reviews 17: 257-262 (1995).After introduction of the library of recombinant DNA genes, the cellsare optionally propagated to allow expression of genes to occur.

4.6.1.1.2 Selection and Screening

Screening is, in general, a two-step process in which one firstdetermines which cells do and do not express a screening marker and thenphysically separates the cells having the desired property. Selection isa form of screening in which identification and physical separation areachieved simultaneously, for example, by expression of a selectablemarker, which, in some genetic circumstances, allows cells expressingthe marker to survive while other cells die (or vice versa). Screeningmarkers include, for example, luciferase, β-galactosidase, and greenfluorescent protein.

Screening can also be done by observing such aspects of growth as colonysize, halo formation, etc. Additionally, screening for production of adesired compound, such as a therapeutic drug or “designer chemical” canbe accomplished by observing binding of cell products to a receptor orligand, such as on a solid support or on a column. Such screening canadditionally be accomplished by binding to antibodies, as in an ELISA.In some instances the screening process is preferably automated so as toallow screening of suitable numbers of colonies or cells. Some examplesof automated screening devices include fluorescence activated cellsorting, especially in conjunction with cells immobilized in agarose(see Powell et. al. Bio/Technology 8: 333-337 (1990); Weaver et. al.Methods 2: 234-247 (1991)), automated ELISA assays, scintillationproximity assays (Hart, H. E. et al., Molecular Immunol. 16: 265-267(1979)) and the formation of fluorescent, coloured or UV absorbingcompounds on agar plates or in microtitre wells (Krawiec, S., Devel.Indust. Microbiology 31: 103-114 (1990)).

Selectable markers can include, for example, drug, toxin resistance, ornutrient synthesis genes. Selection is also done by such techniques asgrowth on a toxic substrate to select for hosts having the ability todetoxify a substrate, growth on a new nutrient source to select forhosts having the ability to utilize that nutrient source, competitivegrowth in culture based on ability to utilize a nutrient source, etc.

In particular, uncloned but differentially expressed proteins (e.g.,those induced in response to new compounds, such as biodegradablepollutants in the medium) can be screened by differential display(Appleyard et al. Mol. Gen. Gent. 247: 338-342 (1995)). Hopwood (PhilTrans R. Soc. Lond B 324: 549-562) provides a review of screens forantibiotic production. Omura (Microbio. Rev. 50: 259-279 (1986) andNisbet (Ann Rev. Med. Chem. 21: 149-157 (1986)) disclose screens forantimicrobial agents, including supersensitive bacteria, detection ofbeta-lactamase and D,D-carboxypeptidase inhibition, beta-lactamaseinduction, chromogenic substrates and monoclonal antibody screens.

Antibiotic targets can also be used as screening targets in highthroughput screening. Antifungals are typically screened by inhibitionof fungal growth. Pharmacological agents can be identified as enzymeinhibitors using plates containing the enzyme and a chromogenicsubstrate, or by automated receptor assays. Hydrolytic enzymes (e.g.,proteases, amylases) can be screened by including the substrate in anagar plate and scoring for a hydrolytic clear zone or by using acolorimetric indicator (Steele et al. Ann. Rev. Microbiol. 45: 89-106(1991)). This can be coupled with the use of stains to detect theeffects of enzyme action (such as congo red to detect the extent ofdegradation of celluloses and hemicelluloses).

Tagged substrates can also be used. For example, lipases and esterasescan be screened using different lengths of fatty acids linked toumbelliferyl. The action of lipases or esterases removes this tag fromthe fatty acid, resulting in a quenching or enhancement of umbelliferylfluorescence. These enzymes can be screened in microtiter plates by arobotic device.

4.6.1.1.3 FACS

Fluorescence activated cell sorting (FACS) methods are also a powerfultool for selection/screening. In some instances a fluorescent moleculeis made within a cell (e.g., green fluorescent protein). The cellsproducing the protein can simply be sorted by FACS. Gel microdroptechnology allows screening of cells encapsulated in agarose microdrops(Weaver et al. Methods 2: 234-247 (1991)). In this technique productssecreted by the cell (such as antibodies or antigens) are immobilizedwith the cell that generated them. Sorting and collection of the dropscontaining the desired product thus also collects the cells that madethe product, and provides a ready source for the cloning of the genesencoding the desired functions. Desired products can be detected byincubating the encapsulated cells with fluorescent antibodies (Powell etal. Bio/Technology 8: 333-337 (1990)). FACS sorting can also be used bythis technique to assay resistance to toxic compounds and antibiotics byselecting droplets that contain multiple cells (i.e., the product ofcontinued division in the presence of a cytotoxic compound; Goguen etal. Nature 363: 189-190 (1995)). This method can select for any enzymethat can change the fluorescence of a substrate that can be immobilizedin the agarose droplet.

4.6.1.1.4 Reporter Molecule

In some embodiments of the invention, screening can be accomplished byassaying reactivity with a reporter molecule reactive with a desiredfeature of, for example, a gene product. Thus, specific functionalitiessuch as antigenic domains can be screened with antibodies specific forthose determinants.

4.6.1.1.5 Cell-Cell Indictaor

In other embodiments of the invention, screening is preferably done witha cell-cell indicator assay. In this assay format, separate librarycells (Cell A, the cell being assayed) and reporter cells (Cell B, theassay cell) are used.

Only one component of the system, the library cells, is allowed toevolve. The screening is generally carried out in a two-dimensionalimmobilized format, such as on plates. The products of the metabolicpathways encoded by these genes (in this case, usually secondarymetabolites such as antibiotics, polyketides, carotenoids, etc.) diffuseout of the library cell to the reporter cell. The product of the librarycell may affect the reporter cell in one of a number of ways.

The assay system (indicator cell) can have a simple readout (e.g., greenfluorescent protein, luciferase, β-galactosidase) which is induced bythe library cell product but which does not affect the library cell. Inthese examples the desired product can be detected by calorimetricchanges in the reporter cells adjacent to the library cell.

4.6.1.1.6 Feedback Mechanism

In other embodiments, indicator cells can in turn produce something thatmodifies the growth rate of the library cells via a feedback mechanism.Growth rate feedback can detect and accumulate very small differences.For example, if the library and reporter cells are competing fornutrients, library cells producing compounds to inhibit the growth ofthe reporter cells will have more available nutrients, and thus willhave more opportunity for growth. This is a useful screen forantibiotics or a library of polyketide synthesis gene clusters whereeach of the library cells is expressing and exporting a differentpolyketide gene product.

4.6.1.1.7 Secretion

Another variation of this theme is that the reporter cell for anantibiotic selection can itself secrete a toxin or antibiotic thatinhibits growth of the library cell. Production by the library cell ofan antibiotic that is able to suppress growth of the reporter cell willthus allow uninhibited growth of the library cell.

Conversely, if the library is being screened for production of acompound that stimulates the growth of the reporter cell (for example,in improving chemical syntheses, the library cell may supply nutrientssuch as amino acids to an auxotrophic reporter, or growth factors to agrowth-factor-dependent reporter. The reporter cell in turn shouldproduce a compound that stimulates the growth of the library cell.Interleukins, growth factors, and nutrients are possibilities. Furtherpossibilities include competition based on ability to kill surroundingcells, positive feedback loops in which the desired product made by theevolved cell stimulates the indicator cell to produce a positive growthfactor for cell A, thus indirectly selecting for increased productformation.

In some embodiments of the invention it can be advantageous to use adifferent organism (or genetic background) for screening than the onethat will be used in the final product. For example, markers can beadded to DNA constructs used for recursive sequence recombination tomake the microorganism dependent on the constructs during theimprovement process, even though those markers may be undesirable in thefinal recombinant microorganism.

Likewise, in some embodiments it is advantageous to use a differentsubstrate for screening an evolved enzyme than the one that will be usedin the final product. For example, Evnin et al. (Proc. Natl. Acad. Sci.U.S.A. 87: 6659-6663 (1990)) selected trypsin variants with alteredsubstrate specificity by requiring that variant trypsin generate anessential amino acid for an arginine auxotroph by cleavingarginine-naphthylamide. This is thus a selection for arginine-specifictrypsin, with the growth rate of the host being proportional to that ofthe enzyme activity.

The pool of cells surviving screening and/or selection is enriched forrecombinant genes conferring the desired phenotype (e.g. alteredsubstrate specificity, altered biosynthetic ability, etc.). Furtherenrichment can be obtained, if desired, by performing a second round ofscreening and/or selection without generating additional diversity.

The recombinant gene or pool of such genes surviving one round ofscreening/selection forms one or more of the substrates for a secondround of recombination. Again, recombination can be performed in vivo orin vitro by any of the recursive sequence recombination formatsdescribed above.

If recursive sequence recombination is performed in vitro, therecombinant gene or genes to form the substrate for recombination shouldbe extracted from the cells in which screening/selection was performed.Optionally, a subsequence of such gene or genes can be excised for moretargeted subsequent recombination. If the recombinant gene(s) arecontained within episomes, their isolation presents no difficulties. Ifthe recombinant genes are chromosomally integrated, they can be isolatedby amplification primed from known sequences flanking the regions inwhich recombination has occurred. Alternatively, whole genomic DNA canbe isolated, optionally amplified, and used as the substrate forrecombination. Small samples of genomic DNA can be amplified by wholegenome amplification with degenerate primers (Barrett et al. NucleicAcids Research 23: 3488-3492 (1995)). These primers result in a largeamount of random 3′ ends, which can undergo homologous recombinationwhen reintroduced into cells.

If the second round of recombination is to be performed in vivo, as isoften the case, it can be performed in the cell survivingscreening/selection, or the recombinant genes can be transferred toanother cell type (e.g., a cell is type having a high frequency ofmutation and/or recombination). In this situation, recombination can beeffected by introducing additional DNA segment(s) into cells bearing therecombinant genes. In other methods, the cells can be induced toexchange genetic information with each other by, for example,electroporation. In some methods, the second round of recombination isperformed by dividing a pool of cells surviving screening/selection inthe first round into two subpopulations. DNA from one subpopulation isisolated and transfected into the other population, where therecombinant gene(s) from the two subpopulations recombine to form afurther library of recombinant genes. In these methods, it is notnecessary to isolate particular genes from the first subpopulation or totake steps to avoid random shearing of DNA during extraction. Rather,the whole genome of DNA sheared or otherwise cleaved into manageablesized fragments is transfected into the second subpopulation. Thisapproach is particularly useful when several genes are being evolvedsimultaneously and/or the location and identity of such genes withinchromosome are not known.

The second round of recombination is sometimes performed exclusivelyamong the recombinant molecules surviving selection. However, in otherembodiments, additional substrates can be introduced. The additionalsubstrates can be of the same form as the substrates used in the firstround of recombination, i.e., additional natural or induced mutants ofthe gene or cluster of genes, forming the substrates for the firstround. Alternatively, the additional substrate(s) in the second round ofrecombination can be exactly the same as the substrate(s) in the firstround of replication.

After the second round of recombination, recombinant genes conferringthe desired phenotype are again selected. The selection process proceedsessentially as before. If a suicide vector bearing a selective markerwas used in the first round of selection, the same vector can be usedagain. Again, a cell or pool of cells surviving selection is selected.If a pool of cells, the cells can be subject to further enrichment.

4.6.1.2 General Methods

4.6.1.2.1 In Vitro

In Vitro Formats one format for recursive sequence recombination invitro is illustrated herein. The initial substrates for recombinationare a pool of related sequences. The X's show where the sequencesdiverge. The sequences can be DNA or RNA and can be of various lengthsdepending on the size of the gene or DNA fragment to be recombined orstochastic &/or non-stochastic mutagenized. Preferably the sequences arefrom 50 bp to 100 kb.

The pool of related substrates are converted into overlapping fragments,e.g., from about 5 bp to 5 kb or more, as shown herein. Often, the sizeof the fragments is from about 10 bp to 1000 bp, and sometimes the sizeof the DNA fragments is from about 100 bp to 500 bp. The conversion canbe effected by a number of different methods, such as DNase1 or RNasedigestion, random shearing or partial restriction enzyme digestion.

Alternatively, the conversion of substrates to fragments can be effectedby incomplete PCR amplification of substrates or PCR primed from asingle primer. Alternatively, appropriate single-stranded fragments canbe generated on a nucleic acid synthesizer. The concentration of nucleicacid fragments of a particular length and sequence is often less than0.1% or 1% by weight of the total nucleic acid. The number of differentspecific nucleic acid fragments in the mixture is usually at least about100, 500 or 1000.

The mixed population of nucleic acid fragments are converted to at leastpartially single-stranded form. Conversion can be effected by heating toabout 80C to 100C, more preferably from 90C to 96 C, to formsingle-stranded nucleic acid fragments and then reannealing. Conversioncan also be effected by treatment with single-stranded DNA bindingprotein or recA protein. Single-stranded nucleic acid fragments havingregions of sequence identity with other single-stranded nucleic acidfragments can then be reannealed by cooling to 4C to 75C, and preferablyfrom 40C to 65C. Renaturation can be accelerated by the addition ofpolyethylene glycol (PEG), other volume-excluding reagents or salt. Thesalt concentration is preferably from 0 mM to 200 mM more preferably thesalt concentration is from 10 mM to 100 mM. The salt may be KCl or NaCl.The concentration of PEG is preferably from 0% to 20%, more preferablyfrom 5% to 10%. The fragments that reanneal can be from differentsubstrates as shown herein. The annealed nucleic acid fragments areincubated in the presence of a nucleic acid polymerase, such as Taq orKlenow, or proofreading polymerases, such as pfu or pwo, and dNTP's(i.e. dATP, dCTP, dGTP and dTTP). If regions of sequence identity arelarge, Taq polymerase can be used with an annealing temperature ofbetween 45-65C. If the areas of identity are small, Klenow polymerasecan be used with an annealing temperature of between 20-30T (Stemmer,Proc. Natl. Acad. Sci. USA (1994), supra). The polymerase can be addedto the random nucleic acid fragments prior to annealing, simultaneouslywith annealing or after annealing.

The process of denaturation, renaturation and incubation in the presenceof polymerase of overlapping fragments to generate a collection ofpolynucleotides containing different permutations of fragments issometimes referred to as stochastic &/or non-stochastic mutagenesis ofthe nucleic acid in vitro. This cycle is repeated for a desired numberof times. Preferably the cycle is repeated from 2 to 100 times, morepreferably the sequence is repeated from 10 to 40 times. The resultingnucleic acids are a family of double-stranded polynucleotides of fromabout 50 bp to about 100 kb, preferably from 500 bp to 50 kb, as shownherein. The population represents variants of the starting substratesshowing substantial sequence identity thereto but also diverging atseveral positions. The population has many more members than thestarting substrates. The population of fragments resulting fromstochastic &/or non-stochastic mutagenesis is used to transform hostcells, optionally after cloning into a vector.

4.6.1.2.1.1 Full Length Sequences

In a variation of in vitro stochastic &/or non-stochastic mutagenesis,subsequences of recombination substrates can be generated by amplifyingthe full-length sequences under conditions which produce a substantialfraction, typically at least 20 percent or more, of incompletelyextended amplification products. The amplification products, includingthe incompletely extended amplification products are denatured andsubjected to at least one additional cycle of reannealing andamplification. This variation, wherein at least one cycle of reannealingand amplification provides a substantial fraction of incompletelyextended products, is termed “stuttering.” In the subsequentamplification round, the incompletely extended products anneal to andprime extension on different sequence-related template species.

4.6.1.2.1.2 Overlapping Single Stranded DNA Fragments

In a further variation, at least one cycle of amplification can beconducted using a collection of overlapping single-stranded DNAfragments of related sequence, and different lengths. Each fragment canhybridize to and prime polynucleotide chain extension of a secondfragment from the collection, thus forming sequence-recombinedpolynucleotides. In a further variation, single-stranded DNA fragmentsof variable length can be generated from a single primer by Vent DNApolymerase on a first DNA template. The single stranded DNA fragmentsare used as primers for a second, Kunkel-type template, consisting of auracil-containing circular single-stranded DNA. This results in multiplesubstitutions of the first template into the second (see Levichkin etal. Mol. Biology 29: 572-577 (1995)).

4.6.1.2.1.3 Gene Clusters

Gene clusters such as those involved in polyketide synthesis (or indeedany multi-enzyme pathways catalyzing is analogous metabolic reactions)can be recombined by recursive sequence recombination even if they lackDNA sequence homology. Homology can be introduced using syntheticoligonucleotides as PCR primers. In addition to the specific sequencesfor the gene being amplified, all of the primers used to amplify onetype of enzyme (for example the acyl carrier protein in polyketidesynthesis) are synthesized to contain an additional sequence of 2040bases 51 to the gene (sequence A) and a different 2040 base sequence 31to the gene (sequence B). The adjacent gene (in this case theketo-synthase) is amplified using a 51 primer which contains thecomplementary strand of sequence B (sequence B′), and a 31 primercontaining a different 2040 base sequence (C). Similarly, primers forthe next adjacent gene (keto-reductases) contain sequences C′(complementary to C) and D. If 5 different polyketide gene clusters arebeing stochastic &/or non-stochastic mutagenized, all five acyl carrierproteins are flanked by sequences A and B following their PCRamplification. In this way, small regions of homology are introduced,making the gene clusters into site specific recombination cassettes.Subsequent to the initial amplification of individual genes, theamplified genes can then be mixed and subjected to primerless PCR.Sequence B at the 3′ end of all of the five acyl carrier protein genescan anneal with and prime DNA synthesis from sequence B1 at the 5′ endof all five keto reductase genes. In this way all possible combinationsof genes within the cluster can be obtained. Oligonucleotides allow suchrecombinants to be obtained in the absence of sufficient sequencehomology for recursive sequence recombination described above. Onlyhomology of function is required to produce functional gene clusters.

4.6.1.2.1.4 Multi Subunit Enzymes

This method is also useful for exploring permutations of any othermulti-subunit enzymes. An example of such enzymes composed of multiplepolypeptides that have shown novel functions when the subunits arecombined in novel ways are dioxygenases. Directed recombination betweenthe four protein subunits of biphenyl and toluene dioxygenases producedfunctional dioxygenases with increased activity againsttrichloroethylene (Furukawa et. al. J. Bacteriol. 176: 2121-2123(1994)). This combination of subunits from the two dioxygenases couldalso have been produced by cassette-stochastic &/or non-stochasticmutagenesis of the dioxygenases as described above, followed byselection for degradation of trichloroethylene.

In some polyketide synthases, the separate functions of the acyl carrierprotein, keto-synthase, keto-reductase, etc. reside in a singlepolypeptide. In these cases domains within the single polypeptide may bestochastic &/or non-stochastic mutagenized, even if sufficient homologydoes not exist naturally, by introducing regions of homology asdescribed above for entire genes. In this case, it may not be possibleto introduce additional flanking sequences to the domains, due to theconstraint of maintaining a continuous open reading frame.

Instead, groups of oligonucleotides are synthesized that are homologousto the 3′ end of the first domain encoded by one of the genes to bestochastic &/or non-stochastic mutagenized, and the 5′ ends of thesecond domains encoded by all of the other genes to be stochastic &/ornon-stochastic mutagenized together. This is repeated with all domains,thus providing sequences that allow recombination between proteindomains while maintaining their order.

4.6.1.2.1.5 Cassette-Based

The cassette-based recombination method can be combined with recursivesequence recombination by including gene fragments (generated by DNase,physical shearing, DNA stuttering, etc.) for one or more of the genes.Thus, in addition to different combinations of entire genes within acluster (e.g., for polyketide synthesis), individual genes can bestochastic &/or non-stochastic mutagenized at the same time (e.g., allacyl carrier protein genes can also be provided as fragmented DNA),allowing a more thorough search of sequence space.

4.6.1.2.1.6 In Vitro Whole Genome Stochastic &/or Non-StochasticMutagenesis

The stochastic &/or non-stochastic mutagenesis of large DNA sequences,such as eukaryotic chromosomes, is difficult by prior art in vitrostochastic &/or non-stochastic mutagenesis methods. A method forovercoming this limitation is described herein.

The cells of related eukaryotic species are gently lysed and the intactchromosomes are liberated. The liberated chromosomes are then sorted byFACS or similar method (such as pulse field electrophoresis) withchromosomes of similar size being sequestered together. Each sizefraction of the sorted chromosomes generally will represent a pool ofanalogous chromosomes, for example the Y chromosome of related mammals.The goal is to isolate intact chromosomes that have not beenirreversibly damaged.

The fragmentation and stochastic &/or non-stochastic mutagenesis of suchlarge complex pieces of DNA employing DNA polymerases is difficult andwould likely introduce an unacceptably high level of random mutations.An alternative approach that employs restriction enzymes and DNA ligaseprovides a feasible less destructive solution. A chromosomal fraction isdigested with one or more restriction enzymes that recognize long DNAsequences (about 15-20 bp), such as the intron and intein encodedendonucleases (1-Ppo 1, I-Ceu 1, PI-Psp 1, PI-Tli 1, PI-Sce I (VDE).

These enzymes each cut, at most, a few times within each chromosome,resulting in a combinatorial mixture of large fragments, each havingoverhanging single stranded termini that are complementary to othersites cleaved by the same enzyme.

The digest is further modified by very short incubation with a singlestranded exonuclease. The polarity of the nuclease chosen is dependenton the single stranded overhang resulting from the restriction enzymechosen. 5′-3′ exonuclease for 3′-overhangs, and 3′-5′-exonuclease for5′overhangs. This digestion results in significantly long regions ofssDNA overhang on each dsDNA termini. The purpose of this incubation isto generate regions of DNA that define specific regions of DNA whererecombination can occur. The fragments are then incubated undercondition where the ends of the fragments anneal with other fragmentshaving homologous ssDNA termini. Often, the two fragments annealing willhave originated from different chromosomes and in the presence of DNAligase are covalently linked to form a chimeric chromosome. Thisgenerates genetic diversity mimicking the crossing over of homologouschromosomes. The complete ligation reaction will contain a combinatorialmixture of all possible ligations of fragments having homologousoverhanging termini. A subset of this population will be completechimeric chromosomes.

To screen the stochastic &/or non-stochastic mutagenized library, thechromosomes are delivered to a suitable host in a manner allowing forthe uptake and expression of entire chromosomes. For example, YACs(yeast artificial chromosomes) can be delivered to eukaryotic cells byprotoplast fusion.

Thus, the reassemble library could be encapsulated in liposomes andfused with protoplasts of the appropriate host cell. The resultingtransformants would be propagated and screened for the desired cellularimprovements. Once an improved population was identified, thechromosomes would be isolated, stochastic &/or non-stochasticmutagenized, and screened recursively.

4.6.1.2.1.7 Whole Genome Stochastic &/or Non-Stochastic Mutagenesis ofNaturally Competent Microorganisms

Natural competence is a phenomenon observed for some microbial specieswhereby individual cells take up DNA from the environment andincorporate it into their genome by homologous recombination. Bacillussubtilis and Acetinetobacter Spp. are known to be particularly efficientat this process. A method for the whole genome stochastic &/ornon-stochastic mutagenesis of these and analogous organisms is describedemploying this process.

One goal of whole genome stochastic &/or non-stochastic mutagenesis isthe rapid accumulation of useful mutations from a population ofindividual strains into one superior strain. If the organisms to beevolved are naturally competent, then a split pooled strategy for therecursive transformation of naturally competent cells with DNAoriginating from the pool will effect this process. An example procedureis as follows.

A population of naturally competent organisms that demonstrates avariety of useful traits (such as increased protein secretion) isidentified. The strains are pooled, and the pool is split. One half ofthe pool is used as a source of gDNA, while the other is used togenerate a pool of naturally competent cells.

The competent cells are grown in the presence of the pooled gDNA toallow DNA uptake and recombination. Cells of one genotype uptake andincorporate gDNA from cells of a different type generating cells havingchimeric genomes. The result is a population of cells representing acombinatorial mixture of the genetic variations originating in theoriginal pool. These cells are pooled again and transformed with thesame source of DNA again. This process is carried out recursively toincrease the diversity of the genomes of cells resulting fromtransformation. Once sufficient diversity has been generated, the cellpopulation is screened for new chimeric organisms demonstrating desiredimprovements.

This process is enhanced by increasing the natural competence of thehost organism. COMS is a protein that, when expressed in B. subtilis,enhances the efficiency of natural competence mediated transformationmore than an order of magnitude.

It was demonstrated that approximately 100% of the cells harboring theplasmid pCOMS uptake and recombine genomic DNA fragments into theirgenomes. In general, approximately 10% of the genome is recombined intoany given transformed cell. This observation was demonstrated by thefollowing.

A strain of B. subtilis pCOMS auxotrophic for two nutritional markerswas transformed with genomic DNA (gDNA) isolated from a prototrophicstrain of the same organism. 10% of the cells exposed to the DNA wereprototrophic for one of the two nutrient markers. The average size ofthe DNA strand taken up by B. subtilis is approximately 50 kb or about2% of the genome. Thus 1 of every ten cells had recombined a marker thatwas represented 1 in every fifty molecules of uptaken gDNA. Thus, mostof the cells take up and recombine with approximately five 50 kbmolecules or 10% of the genome. This method represents a powerful toolfor rapidly and efficiently recombining whole microbial genomes.

In the absence of pCOMS, only 0.3% of the cells prepared for naturalcompetency uptake and integrate a specific marker. This suggested thatabout 15% of the cells actually underwent recombination with a singlegenomic fragment. Thus, a recursive transformation strategy as describedabove produces a whole genome stochastic &/or non-stochastic mutagenizedlibrary, even in the absence of pCOMS. In the absence of pCOMS, however,the complex genomes will represent a smaller, but still screenablepercentage of the transformed or stochastic &/or non-stochasticmutagenized population.

4.6.1.2.1.8 Congression

Congression is the integration of two independent unlinked markers intoa cell. 0.3% of naturally competent B. subtilis cells integrate a singlemarker (described above). Of these, about 10% have taken up anadditional marker. Thus, if one selects or screens for the integrationof one specific marker, 10% of the resulting population will haveintegrated another specific marker. This provides a way of enriching forspecific integration events.

For example, if one is looking for the integration of a gene for whichthere is no easy screen or selection, it will exist as 0.3% of the cellpopulation. If the population is first selected for a specificintegration event, then the desired integration will be found in 10% ofthe population. This represents a significant (about 30-fold) enrichmentfor the desired event. This enrichment is defined as the “congressioneffect.” The congression effect is not influenced by the presence ofpCOMS, thus the “pCOMS effect” is simply to increase the percentage ofnaturally competent cells that are truly naturally competent from about15% in its absence to 100% in its presence. All competent cells stilluptake about the same amount of DNA or about 10% of the Bacillus genome.

The congression effect can be used in the following examples to enhancewhole genome stochastic &/or non-stochastic mutagenesis as well, as thetargeted integration of stochastic &/or non-stochastic mutagenized genesto the chromosome.

4.6.1.2.1.9 B. subtilis Stochastic &/or Non-Stochastic Mutagenesis

A population of B. subtilis cells having desired properties areidentified, pooled and stochastic &/or non-stochastic mutagenized asdescribed above with one exception: once the pooled population is split,half of the population is transformed with an antibiotic selectionmarker that is flanked by sequence that targets its integration anddisruption of a specific nutritional gene, for example, one involved inamino biosynthesis. Transformants resistant to the drug are auxotrophicfor that nutrient. The resistant population is pooled and grown underconditions rendering them naturally competent (or optionally firsttransformed with pCOMS).

The competent cells are then transformed with gDNA isolated from theoriginal pool, and prototrophs are selected. The prototrophic populationwill have undergone recombination with genomic fragments encoding afunctional copy of the nutritional marker, and thus will be enriched forcells having undergone recombination at other genetic loci by thecongression effect.

4.6.1.2.1.10 Targeting of Genes and Gene Libraries to the Chromosome

It is useful to be able to efficiently deliver genes or gene librariesdirectly to a specific location in a cells chromosome. As above, targetcells are transformed with a positive selection marker flanked bysequences that target its homologous recombination into the chromosome.Selected cells harboring the marker are made naturally competent (withor without pCOMS, but preferably the former) and transformed with amixture of two sets of DNA fragments. The first set contains a gene or astochastic &/or non-stochastic mutagenized library of genes each flankedwith sequence to target its integration to a specific chromosomal loci.The second set contains a positive selection marker (different from thatfirst integrated into the cells) flanked by sequence that will targetits integration and replacement of the first positive selection marker.

Under optimal conditions, the mixture is such that the gene or genelibrary is in molar excess over the positive selection marker.Transformants are then selected for cells containing the new positivemarker. These cells are enriched for cells having integrated a copy ofthe desired gene or gene library by the congression effect and can bedirectly screened for cells harboring the gene or gene variants ofinterest. This process was carried out using PCR fragments <10 kb, andit was found that, employing the congression effect, a population can beenriched such that 50% of the cells are congregants. Thus, one in twocells contained a gene or gene variant.

Alternatively, the expression host can be absent of the first positiveselection marker, and the competent cells are transformed with a mixtureof the target genes and a limiting amount of the first positiveselection marker fragment. Cells selected for the positive marker arescreened for the desired properties in the targeted genes. The improvedgenes are amplified by the PCR, stochastic &/or non-stochasticmutagenized again, and then returned to the original host again with thefirst positive selection marker. This process is carried out recursivelyuntil the desired function of the genes are obtained. This processobviates the need to construct a primary host strain and the need fortwo positive markers.

4.6.1.2.1.11 Conjugation-Mediated Genetic Exchange

Conjugation can be employed in the evolution of cell genomes in severalways. Conjugative transfer of DNA occurs during contact between cells.See Guiney (1993) in: Bacterial Conjugation (Clewell, ed., Plenum Press,New York), pp. 75-104; Reimmann & Haas in Bacterial Conjugation(Clewell, ed., Plenum Press, New York 1993), at pp. 137-188(incorporated by reference in their entirety for all purposes).Conjugation occurs between many types of gram negative bacteria, andsome types of gram positive bacteria. Conjugative transfer is also knownbetween bacteria and plant cells (Agrobacterium tumefaciens) or yeast.As discussed in U.S. Pat. No. 5,837,458, the genes responsible forconjugative transfer can themselves be evolved to expand the range ofcell types (e.g., from bacteria to mammals) between which such transfercan occur.

Conjugative transfer is effected by an origin of transfer (oriT) andflanking genes (MOB A, B and C, and 15-25 genes, termed tra, encodingthe structures and enzymes necessary for conjugation to occur. Thetransfer origin is defined as the site required in cis for DNA transfer.Tra genes include tra A, B, C, D, E, F, G, H, I, J, K, L, M, N, P, Q, R,S, T, U, V, W, X, Y, Z, virAB(allelesI-II), C, D, E, G, IHF, and FinOP.Tra genes can be expressed in cis or trans to oriT. Other cellularenzymes, including those of the RecBCD pathway, RecA, SSB protein, DNAgyrase, DNA poll, and DNA ligase, are also involved in conjugativetransfer. RecE or recF pathways can substitute for RecBCD.

One structural protein encoded by a tra gene is the sex pilus, afilament constructed of an aggregate of a single polypeptide protrudingfrom the cell surface. The sex pilus binds to a polysaccharide onrecipient cells and forms a conjugative bridge through which DNA cantransfer. This process activates a site-specific nuclease encoded by aMOB gene, which specifically cleaves DNA to be transferred at oriT. Thecleaved DNA is then threaded through the conjugation bridge by theaction of other tra enzymes.

Mobilizable vectors can exist in episomal form or integrated into thechromosome. Episomal mobilizable vectors can be used to exchangefragments inserted into the vectors between cells. Integratedmobilizable vectors can be used to mobilize adjacent genes from thechromosome.

4.6.1.2.1.12 Use of Integrated Mobilized Vectors to Promote Exchange ofGenomic DNA

The F plasmid of E. coli integrates into the chromosome at highfrequency and mobilizes genes unidirectional from the site ofintegration (Clewell, 1993, supra; Firth et al., in Escherichia coli andSalmonella Cellular and Molecular Biology 2, 23 77-2401 (1996); Frost etal., Microbiol. Rev. 58, 162-210 (1994)). Other mobilizable vectors donot spontaneously integrate into a host chromosome at high efficiency,but can be induced to do so by growth under particular conditions (e.g.,treatment with a mutagenic agent, growth at a nonpermissive temperaturefor plasmid replication). See Reimann & Haas in Bacterial Conjugation(ed. Clewell, Plenum Press, NY 1993), Ch. 6. Of particular interest isthe IncP group of conjugal plasmids which are typified by their broadhost range (Clewell, 1993, supra. Donor “male” bacteria which bear achromosomal insertion of a conjugal plasmid, such as the E. coli Ffactor can efficiently donate chromosomal DNA to recipient “female”enteric bacteria which lack F (F—). Conjugal transfer from donor torecipient is initiated at oriT. Transfer of the nicked single strand tothe recipient occurs in a 5′ to 3′ direction by a rolling circlemechanisms which allows mobilization of tandem chromosomal copies. Uponentering the recipient, the donor strand is discontinuously replicated.The linear, single-stranded donor DNA strand is a potent substrate forinitiation of recA-mediated homologous recombination within therecipient. Recombination between the donor strand and recipientchromosomes can result in the inheritance of donor traits. Accordingly,strains which bear a chromosomal copy of F are designated Hfr (for highfrequency of recombination) (Low, 1996 in Escherichia coli andSalmonella Cellular and Molecular Biology Vol. 2, pp. 2402-2405;Sanderson, in Escherichia coli and Salmonella Cellular and MolecularBiology 2, 2406-2412 (1996)).

The ability of strains with integrated mobilizable vector to transferchromosomal DNA provides a rapid and efficient means of exchanginggenetic material between a population of bacteria thereby allowingcombination of positive mutations and dilution of negative mutations.Such stochastic &/or non-stochastic mutagenesis methods typically startwith a population of strains with an integrated mobilizable vectorencompassing at least some genetic diversity.

The genetic diversity can be the result of natural variation, exposureto a mutagenic agent or introduction of a fragment library. Thepopulation of cells is cultured without selection to allow geneticexchange, recombination and expression of recombinant genes. The cellsare then screened or selected for evolution toward a desired property.The population surviving selection screening can then be subject to afurther round of stochastic &/or non-stochastic mutagenesis byIVR-mediated genetic exchange, or otherwise.

The natural efficiency of Hfr and other strains with integrated mobvectors as recipients of conjugal transfer can be improved by severalmeans. The relatively low recipient efficiency of natural BFR strains isattributable to the products of traS and traT genes of F (Clewell, 1993,supra; Firth et al., 1996, supra.-Frost et al., 1994, supra; Achtman etal., J. Mol. Biol. 138, 779-795 (1980). These products are localized tothe inner and outer membranes of F+ strains, respectively, where theyserve to inhibit redundant matings between two strains which are bothcapable of donating DNA. The effects of traS and traT, and cognate genesin other conjugal plasmids, can be eliminated by use of knockout cellsincapable of expressing these enzymes or reduced by propagating cells ona carbon-limited source. (Peters et al., J. Bacteriol., 178, 3037-3043(1996)).

In some methods, the starting population of cells has a mobilizablevector integrated at different genomic sites. Directional transfer fromoriT typically results in more frequent inheritance of traits proximalto oriT. This is because mating pairs are fragile and tend to dissociate(particularly when in liquid medium) resulting in the interruption oftransfer.

In a population of cells having a mobilizable vector integrated atdifferent sites, chromosomal exchange occurs in a more random fashion.Kits of Hfr strains are available from the E coli. Genetic Stock Centerand the Salmonella Genetic Stock Centre (Frost et al., 1994, supra).

Alternatively, a library of strains with oriT at random sites andorientations can be produced by insertion mutagenesis using a transposonwhich bears oriT. The use of a transposon bearing an oriT [e.g., theTn5-oriT described by Yakobson E A, et al. J. Bacteriol. 1984 October;160(1): 451-453] provides a quick method of generating such a library.Transfer functions for mobilization from the transposon-borne oriT sitesare provided by a helper vector in trans. It is possible to generatesimilar genetic constructs using other sequences known to one of skillas well.

In one aspect, a recursive scheme for genomic stochastic &/ornon-stochastic mutagenesis using Tn-oriT elements is provided. Aprototrophic bacterial strain or set of related strains bearing aconjugal plasmid, such as the F fertility factor or a member of the IncPgroup of broad host range plasmids is mutagenized and screened for thedesired properties. Individuals with the desired properties aremutagenized with a Tn-oriT element and screened for acquisition of anauxotrophy (e.g., by replica-plating to a minimal and complete media)resulting from insertion of the Tn-oriT element in any one of manybiosynthetic gene scattered across the genome. The resulting auxotrophsare pooled and allowed to mate under conditions promoting male-to-malematings, e.g., during growth in close proximity on a filter membrane.Note that transfer functions are provided by the helper conjugal plasmidpresent in the original strain set. Recombinant transconjugants areselected on minimal medium and screened for further improvement.

Optionally, strains bearing integrated mobilizable vectors are defectivein mismatch repair gene(s). Inheritance of donor traits which arise fromsequence heterologies increases in strains lacking the methyl-directedmismatch repair system. Optionally, the gene products which decreaserecombination efficiency can be inhibited by small molecules.

Intergenic conjugal transfer between species such as E. coli andSalmonella typhimurium, which are 20% divergent at the DNA level, isalso possible if the recipient strain is muth, mutL or mutS (seeRayssiguier et al., Nature 342, 396401 (1989)). Such transfer can beused to obtain recombination at several points as shown by the followingexample.

One example uses an S. typhimurium Hfr donor strain having markersthr557 at map position 0, pyrF2690 at 33 min, serA13 at 62 min and hfrK5at 43 min. MutS+/−, F-E. coli. recipient strains had markers pyrD68 at21 min aroC355 at 51 min, ilv3164 at 85 min and mutS215 at 59 min. Thetriauxotrophic S. typhimurium Hfr donor and isogenicmutS+/−triauxotrophic E. coli recipient were inoculated into 3 ml of Lbbroth and shaken at 37C until fully grown. 100 ul of the donor and eachrecipient were mixed in 10 ml fresh LB broth, and then deposited to asterile Millipore 0.45 uM HA filter using a Nalgene 250 ml reusablefiltration device. The donor and recipients alone were similarly dilutedand deposited to check for reversion. The filters with cells were placedcell-side-up on the surface of an LB agar plate which was incubatedovernight at 37C. The filters were removed with the aid of a sterileforceps and placed in a sterile 50 ml tube containing 5 ml of minimalsalts broth. Vigorous vortexing was used to wash the cells from thefilters. 100 ul of mating mixtures, as well as donor and recipientcontrols were spread to LB for viable cell counts and minimal glucosesupplemented with either two of the three recipient requirements forsingle recombinant counts, one of the three requirements for doublerecombinant counts, or none of the three requirements for triplerecombinant counts. The plates were incubated for 48 hr at 37C afterwhich colonies were counted.

Frequencies are further enhanced by increasing the ratio of donor torecipient cells, or by repeatedly mating the original donor strains withthe previously generated recombinant progeny.

4.6.1.2.1.13 Introduction of Fragments by Conjugation

Sobilizable vectors can also be used to transfer fragment libraries intocells to be evolved. This approach is particularly useful in situationsin which the cells to be evolved cannot be efficiently transformeddirectly with the fragment library but can undergo conjugation withprimary cells that can be transformed with the fragment library. DNAfragments to be introduced into host cells encompasses diversityrelative to the host cell genome. The diversity can be the result ofnatural diversity or mutagenesis.

The DNA fragment library is cloned into a mobilizable vector having anorigin of transfer. Some such vectors also contain mob genes althoughalternatively these functions can also be provided in trans. The vectorshould be capable of efficient conjugal transfer between primary cellsand the intended host cells. The vector should also confer a selectablephenotype. This 96 phenotype can be the same as the phenotype beingevolved or can be conferred by a marker, such as a drug resistancemarker. The vector should preferably allow self-elimination in theintended host cells thereby allowing selection for cells in which acloned fragment has undergone genetic exchange with a homologous hostsegment rather than duplication. Such can be achieved by use of vectorlacking an origin of replication functional in the intended host type orinclusion of a negative selection marker in the vector.

One suitable vector is the broad host range conjugation plasmiddescribed by Simon et al., Bio/Technology 1, 784-791 (1983); TrieuCuotet al., Gene 102, 99-104 (1991); Bierman et al., Gene 116, 43-49 (1992).These plasmids can be transformed into E. coli and then force-mated intobacteria that are difficult or impossible to transform by chemical orelectrical induction of competence. These plasmids contain the origin ofthe IncP plasmid, oriT Mobilization functions are supplied in trans bychromosomally-integrated copies of the necessary genes. Conjugaltransfer of DNA can in some cases be assisted by treatment of therecipient (if gram-positive) with sub-inhibitory concentrations ofpenicillins (Trieu-Cuot et al., 1993 FEMS Microbiol. Lett. 109, 19-23).To increase diversity in populations, recursive conjugal mating prior toscreening is performed.

Cells that have undergone allelic exchange with library fragments can bescreened or selected for evolution toward a desired phenotype.Subsequent rounds of recombination can be performed by repeating theconjugal transfer step. The library of fragments can be fresh or can beobtained from some (but not all) of the cells surviving a previous roundof selection/screening. Conjugation-mediated stochastic &/ornon-stochastic mutagenesis can be combined with other methods ofstochastic &/or non-stochastic mutagenesis.

4.6.1.2.1.14 Genetic Exchange Promoted by Transducing Prage in CellsSuseptible to Phage

Phage transduction can include the transfer, from one cell to another,of nonviral genetic material within a viral coat (Masters, inEscherichia coli and Salmonella Cellular and Molecular Biology 2,2421-2442 (1996). Perhaps the two best examples of generalizedtransducing phage are bacteriophages P I and P22 of E. coli and S.typhimurium, respectively. Generalized transducing bacteriophageparticles are formed at a low frequency during lytic infection whenviral-genome-sized, doubled-stranded fragments of host (which serves asdonor) chromosomal DNA are packaged into phage heads. Promiscuous hightransducing (HT) mutants of bacteriophage P22 which efficiently packageDNA with little sequence specificity have been isolated. Infection of asusceptible host results in a lysate in which up to 50% of the phage aretransducing particles. Adsorption of the generalized transducingparticle to a susceptible recipient cell results in the injection of thedonor chromosomal fragment. RecA-mediated homologous recombinationfollowing injection of the donor fragment can result in the inheritanceof donor traits. Another type of phage which achieves quasi randominsertion of DNA into the host chromosome is Mu. For an overview of Mubiology, see, Groisman (1991) in Methods in Enzymology v. 204. Mu cangenerate a variety of chromosomal rearrangements including deletions,inversions, duplications and transpositions. In addition, elements whichcombine the features of P22 and Mu are available, including Mud-P22,which contains the ends of the Mu genome in place of the P22 att siteand int gene. See, Berg, supra.

Generalized transducing phage can be used to exchange genetic materialbetween a population of cells encompassing genetic diversity andsusceptible to infection by the phage. Genetic diversity can be theresult of natural variation between cells, induced mutation of cells orthe introduction of fragment libraries into cells. DNA is then exchangedbetween cells by generalized transduction. If the phage does not causelysis of cells, the entire population of cells can be propagated in thepresence of phage. If the phage results in lytic infection, transductionis performed on a split pool basis. That is, the starting population ofcells is divided into two. One subpopulation is used to preparetransducing phage. The transducing phage are then infected into theother subpopulation. Preferably, infection is performed at highmultiplicity of phage per cell so that few cells remain uninfected.Cells surviving infection are propagated and screened or selected forevolution toward a desired property. The pool of cells survivingscreening/selection can then be stochastic &/or non-stochasticmutagenized by a further round of generalized transduction or by otherstochastic &/or non-stochastic mutagenesis methods. Recursive split pooltranduction is optionally performed prior to selection to increase thediversity of any population to be screened.

The efficiency of the above methods can be increased by reducinginfection of cells by infectious (nontransducing phage) and by reducinglysogen formation. The former can be achieved by inclusion of chelatorsof divalent cations, such as citrate and EDTA in culture media. Taildefective transducing phages can be used to allow only a single round ofinfection.

Divalent cations are required for phage absorption and the inclusion ofchelating agents therefore provides a means of preventing unwantedinfection. Integration defective (int) derivatives of generalizedtransducing phage can be used to prevent lysogen formation. In a furthervariation, host cells with defects in mismatch repair gene(s) can beused to increase recombination between transduced DNA and genomic DNA.

4.6.1.2.1.15 Use of Locked in Prophages to Facilitate DNA Stochastic&/or Non-Stochastic Mutagenesis

The use of a hybrid, mobile genetic element (locked-in prophages) as ameans to facilitate whole genome stochastic &/or non-stochasticmutagenesis of organisms using phage transduction as a means to transferDNA from donor to recipient is a preferred embodiment. One such element(Mud-P22) based on the temperate Salmonella phage P22 has been describedfor use in genetic and physical mapping of mutations. See, Youderian etal. (1988) Genetics 118: 581-592, and Benson and Goldman (1992) J.Bacteriol. 174(5): 1673-1681. Individual Mud-P22 insertions packagespecific regions of the Salmonella chromosome into phage P22 particles.

Libraries of random Mud-P22 insertions can be readily isolated andinduced to create pools of phage particles packaging random chromosomalDNA fragments. These phage particles can be used to infect new cells andtransfer the DNA from the host into the recipient in the process oftransduction. Alternatively, the packaged chromosomal DNA can beisolated and manipulated further by techniques such as DNA stochastic&/or non-stochastic mutagenesis or any other mutagenesis technique priorto being reintroduced into cells (especially recD cells for linear DNA)by transformation or electroporation, where they integrate into thechromosome. Either the intact transducing phage particles or isolatedDNA can be subjected to a variety of mutagens prior to reintroductioninto cells to enhance the mutation rate.

Mutator cell fines such as mutD can also be used for phage growth.Either method can be used recursively in a process to create genes orstrains with desired properties. E. coli cells carrying a cosmid cloneof Salmonella LPS genes are infectable by P22 phage. It is possible todevelop similar genetic elements using other combinations oftransposable elements and bacteriophages or viruses as well. P22 is alambdoid phage that packages its DNA into pstochastic &/ornon-stochastic mutagenized phage particles (heads) by a “headful”mechanism. Packaging of phage DNA is initiated at a specific site (pac)and proceeds unidirectionally along a linear, double stranded normallyconcatameric molecule. When the phage head is full (about 43 kb), theDNA strand is cleaved, and packaging of the next phage head isinitiated. Locked-in or excision-defective P22 prophages, however,initiate packaging at their pac site, and then proceed unidirectionallyalong the chromosome, packaging successive headfuls of chromosomal DNA(rather than phage DNA). When these transducing phages infect newSalmonella cells they inject the chromosomal DNA from the original hostinto the recipient cell, where it can recombine into the chromosome byhomologous recombination creating a chimeric chromosome. Upon infectionof recipient cells at a high multiplicity of infection, recombinationcan also occur between incoming transducing fragments prior torecombination into the chromosome.

Integration of such locked-in P22 prophages at various sites in thechromosome allows flanking regions to be amplified and packaged intophage particles. The Mud-P22 mobile genetic element contains anexcision-defective P22 prophage flanked by the ends of phage/transposonMu. The entire Mud-P22 element can transpose to virtually any locationin the chromosome or other episome (eg. F′, BAC clone) when the Mu A andB proteins are provided in trans.

A number of embodiments for this type of genetic element are available.In one example, the locked in prophage are used as generalizedtransducing phage to transfer random fragments of a donor chromosomeinto a recipient. The Mud-P22 element acts as a transposon when Mu A andB transposase proteins are provided in trans and integrate copies ofitself at random locations in the chromosome. In this way, a library ofrandom chromosomal Mud-P22 insertions can be generated in a suitablehost. When the Mud-P22 prophages in this library are induced, randomfragments of chromosomal DNA will be packaged into phage particles. Whenthese phages infect recipient cells, the chromosomal DNA is injected andcan recombine into the chromosome of the recipient. These recipientcells are screened for a desired property and cells showing improvementare then propagated.

The process can be repeated, since the Mud-P22 genetic element is nottransferred to the recipient in this process. Infection at a highmultiplicity allows for multiple chromosomal fragments to be injectedand recombined into the recipient chromosome. Locked in prophages canalso be used as specialized transducing phage.

Individual insertions near a gene of interest can be isolated from arandom insertion library by a variety of methods. Induction of thesespecific prophages results in packaging of flanking chromosomal DNAincluding the gene(s) of interest into phage particles. Infection ofrecipient cells with these phages and recombination of the packaged DNAinto the chromosome creates chimeric genes that can be screened fordesired properties. Infection at a high multiplicity of infection canallow recombination between incoming transducing fragments prior torecombination into the chromosome.

These specialized transducing phage can also be used to isolate largequantities of high quality DNA containing specific genes of interestwithout any prior knowledge of the DNA sequence. Cloning of specificgenes is not required. Insertion of such an element nearby abiosynthetic operon for example allows for large amounts of DNA fromthat operon to be isolated for use in DNA stochastic &/or non-stochasticmutagenesis (in vitro and/or in vivo), cloning, sequencing, or otheruses as set forth herein. DNA isolated from similar insertions in otherorganisms containing homologous operons are optionally mixed for use infamily stochastic &/or non-stochastic mutagenesis formats as described,herein, in which homologous genes from different organisms (or differentchromosomal locations within a single species, or both). Alternatively,the transduced population is recursively transduced with pooledtransducing phage or new transducing phage generated from the previouslytransduced cells. This can be carried out recursively to optimize thediversity of the genes prior to stochastic &/or non-stochasticmutagenesis.

Phage isolated from insertions in a variety of strains or organismscontaining homologous operons are optionally mixed and used to coinfectcells at a high MOI allowing for recombination between incomingtransducing fragments prior to recombination into the chromosome.

Locked in prophage are useful for mapping of genes, operons, and/orspecific mutations with either desirable or undesirable phenotypes.Locked-in prophages can also provide a means to separate and mapmultiple mutations in a given host. If one is looking for beneficialmutations outside a gene or operon of interest, then an unmodified geneor operon can be transduced into a mutagenized or stochastic &/ornon-stochastic mutagenized host then screened for the presence ofdesired secondary mutations. Alternatively, the gene/operon of interestcan be readily moved from a mutagenized/stochastic &/or non-stochasticmutagenized host into a different background to screen/select formodifications in the gene/operon itself. It is also possible to developsimilar genetic elements using other combinations of transposableelements and bacteriophages or viruses as well. Similar systems are setup in other organisms, e.g., that do not allow replication of P22 or P1.Broad host range phages and transposable elements are especially useful.Similar genetic elements are derived from other temperate phages thatalso package by a headful mechanism. In general, these are the phagesthat are capable of generalized transduction. Viruses infectingeukaryotic cells may be adapted for similar purposes. Examples ofgeneralized transducing phages that are useful are described in: Greenet al., “Isolation and preliminary characterization of lytic andlysogenic phages with wide host range within the streptomycetes”, J GenMicrobiol 131(9): 2459-2465 (1985); Studdard et al., “Genome structurein Streptomyces spp.: adjacent genes on the S. coelicolor A3(2) linkagemap have cotransducible analogs in S. venezuelae”, J Bacteriol 169(8):3814-3816 (1987); Wang et al., “High frequency generalized transductionby miniMu plasmid phage”, Genetics 116(2): 201-206, (1987); Welker, N.E., “Transduction in Bacillus stearothertnophilus”, J Bacteriol,176(11): 3354-3359, (1988); Darzins et al., “Mini-D3112 bacteriophagetransposable elements for genetic analysis of Pseudomonas aeruginosa, JBacteriol 171(7): 3909-3916 (1989); Hugouvieux-Cotte-Pattat et al.,“Expanded linkage map of Erwinia chrysanthemi strain 3937”, MolMicrobiol 3(5): 573-581, (1989); Ichige et al. “Establishment of genetransfer systems for and construction of the genetic map of a marineVibrio strain”, J Bacteriol 171(4):1825-1834 (1989); Murainatsu et al.,“Two generalized transducing phages in Vibrio parahaemolyticus andVibrio alginolyticus”, Microbiol Immunol (12): 1073-1084 (1991); Regueet al., “A generalized transducing bacteriophage for Serratiamarcescens”, Res Microbiol 42(1): 23-27, (1991)-Kiesel et al, “Phage Acm1-mediated transduction in the facultatively methanol-utilizingAcetobacter methanolicus MB 58/4”, J Gen Virol 74(9): 1741-1745 (1993);Blahova et al., “Transduction of imipenem resistance by the phage F-116from a nosocomial strain of Pseudomonas aeruginosa isolated inSlovakia”, Acta Virol 38(5): 247-250 (1994); Kidambi et al., “Evidencefor phage-mediated gene transfer among Pseudomonas aeruginosa strains onthe phylloplane”, Appl Environ Microbiol 60: (2) 496-500 (1994); Weisset al., “Isolation and characterization of a generalized transducingphage for Xanthomonas campestris pv. campestris”, J Bacteriol 176(11):3354-3359 (1994); Matsumoto et al., “Clustering of the trp genes inBurkholderia (formerly Pseudomonas) cepacia”, FEMS Microbiol Lett134(2-3): 265-271 (1995); Schicklmaier et al., “Frequency of generalizedtransducing phages in natural isolates of the Salmonella typhimuriumcomplex”, Appl Environ Microbiol 61(4): 61(4): 1637-1640 (1995);Humphrey et al., “Purification and characterization of VSH-1, ageneralized transducing bacteriophage of Serpulina hyodysenteriae”, JBacteriol 179(2): 323-329 (1997); Willi et al, “Transduction ofantibiotic resistance markers among Actinobacillus actinomycetemcomitansstrains by temperate bacteriophages Aa phi 23”, Cell Mol Life Sci 53(11-12): 904-910 (1997); Jensen et al., “Prevalence of broad-host-rangelytic bacteriophages of Sphaerofilus natans, Escherichia coli, andPseudomonas aeruginosa”, Appi Environ Microbiol 64(2): 575-580 (1998),and Nedelmann et al., “Generalized transduction for genetic linkageanalysis and transfer of transposon insertions in differentStaphylococcus epidermidis strains”, Zentiviralaffil Bakteriol 287(1-2):85-92 (1998).

A Mud-PI/Tn-PI system comparable to Mud-P22 is developed using phageP 1. Phage P I has an advantage of packaging much larger (about 110 kb)fragments per headful. Phage P I is currently used to create bacterialartificial chromosomes or BAC's. P 1-based BAC vectors are designedalong these principles so that cloned DNA is packaged into phageparticles, rather than the current system, which requires DNApreparation from single-copy episomes. This combines the advantages ofboth systems in having the genes cloned in a stable single-copy format,while allowing for amplification and specific packaging of cloned DNAupon induction of the prophage.

4.6.1.2.1.16 Random Placement of Genes or Improved Genes Throughout theGenome for Optimization of Gene Context

The placement and orientation of genes in a host chromosome (the“context” of the gene in a chromosome) or episome has large effects ongene expression and activity. Random integration of plasmid or otherepisomal sequences into a host chromosome by non-homologousrecombination, followed by selection or screening for the desiredphenotype, is a preferred way of identifying optimal chromosomalpositions for expression of a target. This strategy is illustratedherein.

A variety of transposon mediated delivery systems can be employed todeliver genes of interest, either individual genes, genomic libraries,or a library of stochastic &/or non-stochastic mutagenized gene(s)randomly throughout the genome of a host. Thus, in one preferredembodiment, the improvement of a cellular function is achieved bycloning a gene of interest, for example a gene encoding a desiredmetabolic pathway, within a transposon delivery vehicle.

Such transposon vehicles are available for both Gram-negative andGram-positive bacteria. De Lorenzo and Timis (1994) Methods inEnzymology 235: 385404 describe the analysis and construction of stablephenotypes in gram-negative Bacteria with Tn5- and Tn 10-derivedminitransposons. Kleckner et al. (1991) Methods in Enzymology 204,chapter 7 describe uses of transposons such as Tn 10, including for usein gram positive bacteria. Petit et al. (1990) Journal of Bacteriology,172(12): 6736-6740 describe Tn 10 derived transposons active in BacillusSubtilis. The transposon delivery vehicle is introduced into a cellpopulation, which is then selected for recombinant cells that haveincorporated the transposon into the genome.

The selection is typically by any of a variety of drug resistant markersalso carried within the transposon. The selected subpopulation isscreened for cells having improved expression of the gene(s) ofinterest. Once cells harboring the genes of interest in the optimallocation are isolated, the genes are amplified from within the genomeusing PCR, stochastic &/or non-stochastic mutagenized, and cloned backinto a similar transposon delivery vehicle which contains a differentselection marker within the transposon and lacks the transposonintegrase gene.

This stochastic &/or non-stochastic mutagenized library is thentransformed back into the strain harboring the original transposon, andthe cells are selected for the presence of the new resistance marker andthe loss of the previous selection marker. Selected cells are enrichedfor those that have exchanged by homologous recombination the originaltransposon for the new transposon carrying members of the stochastic&/or non-stochastic mutagenized library. The surviving cells are thenscreened for further improvements in the expression of the desiredphenotype. The genes from the improved cells are then amplified by thePCR and stochastic &/or non-stochastic mutagenized again. This processis carried out recursively, oscillating each cycle between the differentselection markers. Once the gene(s) of interest are optimized to adesired level, the fragment can be amplified and again randomlydistributed throughout the genome as described above to identify theoptimal location of the improved genes.

Alternatively, the gene(s) conferring a desired property may not beknown. In this case the DNA fragments cloned within the transposondelivery vehicle could be a library of genomic fragments originatingfrom a population of cells derived from one or more strains having thedesired property(ies). The library is delivered to a population of cellsderived from one or more strains having or lacking the desiredproperty(ies) and cells incorporating the transposon are selected. Thesurviving cells are then screened for acquisition or improvement of thedesired property. The fragments contained within the surviving cells areamplified by PCR and then cloned as a pool into a similar transposondelivery vector harboring a different selection marker from the firstdelivery vector. This library is then delivered to the pool of survivingcells, and the population having acquired the new selective marker isselected. The selected cells are then screened for further acquisitionor improvement of the desired property.

In this way the different possible combinations of genes conferring orimproving a desired phenotype are explored in a combinatorial fashion.This process is carried out repetitively with each new cycle employingan additional selection marker. Alternatively, PCR fragments are clonedinto a pool of transposon vectors having different selective markers.These are delivered to cells and selected for 1, 2, 3, or more markers.

Alternatively, the amplified fragments from each improved cell arestochastic &/or non-stochastic mutagenized independently. The stochastic&/or non-stochastic mutagenized libraries are then cloned back into atransposon delivery vehicle similar to the original vector butcontaining a different selection marker and lacking the transposasegene. Selection is then for acquisition of the new marker and loss ofthe previous marker. Selected cells are enriched for those incorporatingthe stochastic &/or non-stochastic mutagenized variants of the amplifiedgenes by homologous recombination. This process is carried outrecursively, oscillating each cycle between the two selective markers.

4.6.1.2.1.17 Improvement of Overexpressed Genes for a Desired Phenotype

The improvement of a cellular property or phenotype is often enhanced byincreasing the copy number or expression of gene(s) participating in theexpression of that property. Genes that have such an effect on a desiredproperty can also be improved by DNA stochastic &/or non-stochasticmutagenesis to have a similar effect. A genomic DNA library is clonedinto an overexpression vector and transformed into a target cellpopulation such that the genomic fragments are highly expressed in cellsselected for the presence of the overexpression vector. The selectedcells are then screened for improvement of a desired property. Theoverexpression vector from the improved cells are isolated and thecloned genomic fragments stochastic &/or non-stochastic mutagenized. Thegenomic fragment carried in the vector from each improved isolate isstochastic &/or non-stochastic mutagenized independently or withidentified homologous genes (family stochastic &/or non-stochasticmutagenesis). The stochastic &/or non-stochastic mutagenized librariesare then delivered back to a population of cells and the selectedtransformants rescreened for further improvements in the desiredproperty. This stochastic &/or non-stochastic mutagenesis/screeningprocess is cycled recursively until the desired property has beenoptimized to the desired level. As stated above, gene dosage can greatlyenhance a desired cellular property.

One method of increasing gene copy number of unknown genes is using amethod of random amplification (see also, Mavingui et. al. (1997) NatureBiotech, 15, 5 64). In this method, a genomic library is cloned into asuicide vector containing a selective marker that also at higher dosageprovides an enhanced phenotype. An example of such a marker is thekanamycin resistance gene. At successively higher copy number,resistance to successively higher levels of kanamycin is achieved. Thegenomic library is delivered to a target cell by any of a variety ofmethods including transformation, transduction, conjugation, etc. Cellsthat have incorporated the vector into the chromosome by homologousrecombination between the vector and chromosomal copies of the clonedgenes can be selected by requiring expression of the selection markerunder conditions where the vector does not replicate. This recombinationevent results in the duplication of the cloned DNA fragment in the hostchromosome with a copy of the vector and selection marker separating thetwo copies. The population of surviving cells are screened forimprovement of a desired cellular property resulting form the geneduplication event. Further gene duplication events resulting inadditional copies of the original cloned DNA fragments can be generatedby further propagating the cells under successively more stringentselective conditions i.e. increased concentrations of kanamycin. In thiscase selection requires increased copies of the selective marker, butincreased copies of the desired gene fragment is also concomitant.Surviving cells are further screened for an improvement in the desiredphenotype. The resulting population of cells likely resulted in theamplification of different genes since often many genes effect a givenphenotype. To generate a library of the possible combinations of thesegenes, the original selected library showing phenotypic improvements arerecombined, using the methods described herein, e.g., protoplast fusion,split pool transduction, transformation, conjugation, etc.

The recombined cells are selected for increased expression of theselective marker. Survivors are enriched for cells having incorporatedadditional copies of the vector sequence by homologous recombination,and these cells will be enriched for those having combined duplicationsof different genes. In other words, the duplication from one cell ofenhanced phenotype becomes combined with the duplication of another cellof enhanced phenotype. These survivors are screened for furtherimprovements in the desired phenotype. This procedure is repeatedrecursively until the desired level of phenotypic expression isachieved.

Alternatively, genes that have been identified or are suspected as beingbeneficial in increased copy number are cloned in tandem intoappropriate plasmid vectors. These vectors are then transformed andpropagated in an appropriate host organism. Plasmid-plasmidrecombination between the cloned gene fragments result in furtherduplication of the genes. Resolution of the plasmid doublet can resultin the uneven distribution of the gene copies, with some plasmids havingadditional gene copies and others having fewer gene copies. Cellscarrying this distribution of plasmids are then screened for animprovement in the phenotype effected by the gene duplications.

In summary, a method of selecting for increased copy number of a nucleicacid sequence by the above procedure is provided. In the method, agenomic library in a suicide vector comprising a dose-sensitiveselectable marker is provided, as noted above. The genomic library istransduced into a population of target cells. The target cells areselected in a population of target cells for increasing doses of theselectable marker under conditions in which the suicide vector does notreplicate episomally. A plurality of target cells are selected for thedesired phenotype, recombined and reselected. The process is recursivelyrepeated, if desired, until the desired phenotype is obtained.

4.6.1.2.1.18 Strategies for Improving Genomic Stochastic &/orNon-Stochastic Mutagenesis via Transformation of Linear DNA Fragments

Wild-type members of the Enterobacteriaceae (e.g., Escherichia coli) aretypically resistant to genetic exchange following transformation oflinear DNA molecules.

This is due, at least in part, to the Exonuclease V (Exo V) activity ofthe RecBCD holoenzyme which rapidly degrades linear DNA moleculesfollowing transformation. Production of ExoV has been traced to the recDgene, which encodes the D subunit of the holoenzyme. As demonstrated byRussel et al. (1989) Journal of Bacteriology 2609-2613, homologousrecombination between a transformed linear donor DNA molecule and thechromosome of recipient is readily detected in a strains bearing a lossof function mutation in a recD mutant.

The use of recD strains provides a simple means for genomic stochastic&/or non-stochastic mutagenesis of the Enterobacteriaceae. For example,a bacterial strain or set of related strains bearing a recD nullmutation (e.g., the E. coli recD1903::mini-Tet allele) is mutagenizedand screened for the desired properties. In a split-pool fashion,chromosomal DNA prepared on one aliquot could be used to transform(e.g., via electroporation or chemically induced competence) the secondaliquot. The resulting transformants are then screened for improvement,or recursively transformed prior to screening.

The use of RecE/recT as described supra, can improve homologousrecombination of linear DNA fragments. The RecBCD holoezyme plays animportant role in initiation of RecA-dependent homologous recombination.Upon recognizing a dsDNA end, the RecBCD enzyme unwinds and degrades theDNA asymmetrically in a 5′ to 3′ direction until it encounters a chi (or‘X’)-site (consensus 5′-GCTGGTGG-3) which attenuates the nucleaseactivity. This results in the generation of a ssDNA terminating near thec site with a 3′-ssDNA tail that is preferred for RecA loading andsubsequent invasion of dsDNA for homologous recombination. Accordingly,preprocessing of transforming fragments with a 5′ to 3′ specific ssDNAExonuclease, such as Lamda ( ) exonuclease (available, e.g., fromBoeringer Mannheim) prior to transformation may serve to stimulatehomologous recombination in recD-strain by providing ssDNA invasive endfor RecA loading and subsequent strand invasion.

The addition of DNA sequence encoding chi-sites (consensus5′-GCTGGTGG-3′) to DNA fragments can serve to both attenuate ExonucleaseV activity and stimulate homologous recombination, thereby obviating theneed for a recD mutation (see also, Kowalczykowski, et al. (1994)“Biochemistry of homologous recombination in Escherichia coli,”Microbiol. Rev. 58: 401465 and Jessen, et al. (1998) “Modification ofbacterial artificial chromosomes through Chi-stimulated homologousrecombination and its application in zebrafish transgenesis.” Proc.Natl. Acad. Sci. 95: 5121-5126). Chi sites are optionally included inlinkers ligated to the ends of transforming fragments or incorporatedinto the external primers used to generate DNA fragments to betransformed. The use of recombination-stimulatory sequences such as chiis a generally useful approach for evolution of a broad range of celltypes by fragment transformation. Methods to inhibit or mutate analogsof Exo V or other nucleases (such as, Exonucleases I (endA 1), 111(nth), IV (nfo), VII, and VIII of E. coli) is similarly useful.

Inhibition or elimination of nucleases, or modification of ends oftransforming DNA fragments to render them resistant to exonucleaseactivity has applications in evolution of a broad range of cell types.

4.6.1.2.1.19 Stochastic &/or Non-Stochastic Mutagenesis to OptimizeUnknown Interactions

Many observed traits are the result of complex interactions of multiplegenes or gene products. Most such interactions are stilluncharacterized. Accordingly, it is often unclear which genes need to beoptimized to achieve a desired trait, even if some of the genescontributing to the trait are known.

This lack of characterization is not an issue during DNA stochastic &/ornon-stochastic mutagenesis, which produces solutions that optimizewhatever is selected for. An alternative approach, which has thepotential to solve not only this problem, but also anticipated futurerate limiting factors, is complementation by overexpression of unknowngenomic sequences.

A library of genomic DNA is first made as described, supra. This istransformed into the cell to be optimized and transformants are screenedfor increases in a desired property. Genomic fragments which result inan improved property are evolved by DNA stochastic &/or non-stochasticmutagenesis to further increase their beneficial effect. This approachrequires no sequence information, nor any knowledge or assumptions aboutthe nature of protein or pathway interactions, or even of what steps arerate-limiting; it relies only on detection of the desired phenotype.This sort of random cloning and subsequent evolution by DNA stochastic&/or non-stochastic mutagenesis of positively interacting genomicsequences is extremely powerful and generic. A variety of sources ofgenomic DNA are used, from isogenic strains to more distantly relatedspecies with potentially desirable properties. In addition, thetechnique is applicable to any cell for which the molecular biologybasics of transformation and cloning vectors are available, and for anyproperty which can be assayed (preferably in a high-throughput format).Alternatively, once optimized, the evolved DNA can be returned to thechromosome by homologous recombination or randomly by phage mediatedsite-specific recombination.

4.6.1.2.1.20 Homologous Recombination within the Chromosome

Homologous recombination within the chromosome is used to circumvent thelimitations of plasmid based evolution and size restrictions. Thestrategy is similar to that described above for stochastic &/ornon-stochastic mutagenesis genes within their chromosomal context,except that no in vitro stochastic &/or non-stochastic mutagenesisoccurs. Instead, the parent strain is treated with mutagens such asultraviolet light or nitrosoguanidine, and improved mutants areselected. The improved mutants are pooled and split. Half of the pool isused to generate random genomic fragments for cloning into a homologousrecombination vector. Additional genomic fragments are optionallyderived from related species with desirable properties. The clonedgenomic fragments are homologously recombined into the genomes of theremaining half of the mutant pool, and variants with improved propertiesare selected. These are subjected to a further round of mutagenesis,selection and recombination. Again this process is entirely generic forthe improvement of any whole cell biocatalyst for which a recombinationvector and an assay can be developed. Here again, it should be notedthat recombination can be performed recursively prior to screening.

4.6.1.2.1.21 Methods for Recursive Sequence Recombination

As shown herein, DNA Stochastic &/or non-stochastic mutagenesis providesmost rapid technology for evolution of complex new functions. As shownherein, recombination in DNA stochastic &/or non-stochastic mutagenesisachieves accumulation of multiple beneficial mutations in a few cycles.In contrast, because of the high frequency of deleterious mutationsrelative to beneficial ones, iterative point mutation must buildbeneficial mutations one at a time, and consequently requires manycycles to reach the same point. As shown herein, rather than a simplelinear sequence of mutation accumulation, DNA stochastic &/ornon-stochastic mutagenesis is a parallel process where multiple problemsmay be solved independently, and then combined.

4.6.1.2.2 In Vivo Formats

4.6.1.2.2.1 Plasmid-Plasmid Recombination

The initial substrates for recombination are a collection ofpolynucleotides comprising variant forms of a gene. The variant formsusually show substantial sequence identity to each other sufficient toallow homologous recombination between substrates. The diversity betweenthe polynucleotides can be natural (e.g., allelic or species variants),induced (e.g., error-prone PCR or error-prone recursive sequencerecombination), or the result of in vitro recombination. Diversity canalso result from resynthesizing genes encoding natural proteins withalternative codon usage. There should be at least sufficient diversitybetween substrates that recombination can generate more diverse productsthan there are starting materials. There must be at least two substratesdiffering in at least two positions.

However, commonly a library of substrates of 10³-10⁸ members isemployed. The degree of diversity depends on the length of the substratebeing recombined and the extent of the functional change to be evolved.Diversity at between 0.1-25% of positions is typical. The diversesubstrates are incorporated into plasmids. The plasmids are oftenstandard cloning vectors, e.g., bacterial multicopy plasmids. However,in some methods to be described below, the plasmids include mobilization(MOB) functions. The substrates can be incorporated into the same ordifferent plasmids. Often at least two different types of plasmid havingdifferent types of selectable markers are used to allow selection forcells containing at least two types of vector. Also, where differenttypes of plasmid are employed, the different plasmids can come from twodistinct incompatibility groups to S allow stable co-existence of twodifferent plasmids within the cell. Nevertheless, plasmids from the sameincompatibility group can still co-exist within the same cell forsufficient time to allow homologous recombination to occur.

Plasmids containing diverse substrates are initially introduced intocells by any method (e.g., chemical transformation, natural competence,electroporation, biolistics, packaging into phage or viral systems).Often, the plasmids are present at or near saturating concentration(with respect to maximum transfection capacity) to increase theprobability of more than one plasmid entering the same cell. Theplasmids containing the various substrates can be transfectedsimultaneously or in multiple rounds. For example, in the latterapproach cells can be transfected with a first aliquot of plasmid,transfectants selected and propagated, and then infected with a secondaliquot of plasmid.

Having introduced the plasmids into cells, recombination betweensubstrates to generate recombinant genes occurs within cells containingmultiple different plasmids merely by propagating the cells. However,cells that receive only one plasmid are unable to participate inrecombination and the potential contribution of substrates on suchplasmids to evolution is not fully exploited (although these plasmidsmay contribute to some extent if they are progagated in mutator cells).The rate of evolution can be increased by allowing all substrates toparticipate in recombination. Such can be achieved by subjectingtransfected cells to electroporation. The conditions for electroporationare the same as those conventionally used for introducing exogenous DNAinto cells (e.g., 1,000-2,500 volts, 400 uF and a 1-2 mM gap). Underthese conditions, plasmids are exchanged between cells allowing allsubstrates to participate in recombination.

In addition the products of recombination can undergo further rounds ofrecombination with each other or with the original substrate. The rateof evolution can also be increased by use of conjugative transfer. Toexploit conjugative transfer, substrates can be cloned into plasmidshaving MOB genes, and tra genes are also provided in cis or in trans tothe MOB genes. The effect of conjugative transfer is very similar toelectroporation in that it allows plasmids to move between cells andallows recombination between any substrate and the products of previousrecombination to occur, merely by propagating the culture. The rate ofevolution can also be increased by fusing cells to induce exchange ofplasmids or chromosomes. Fusion can be induced by chemical agents, suchas PEG, or viral proteins, such as influenza virus hemagglutinin, HSV-1gB and gD. The rate of evolution can also be increased by use of mutatorhost cells (e.g., Mut L, S, D, T, H in bacteria and Ataxiatelangiectasia human cell lines).

The time for which cells are propagated and recombination is allowed tooccur, of course, varies with the cell type but is generally notcritical, because even a small degree of recombination can substantiallyincrease diversity relative to the starting materials. Cells bearingplasmids containing recombined genes are subject to screening orselection for a desired function. For example, if the substrate beingevolved contains a drug resistance gene, one would select for drugresistance. Cells surviving screening or selection can be subjected toone or more rounds of screening/selection followed by recombination orcan be subjected directly to an additional round of recombination. Thenext round of recombination can be achieved by several different formatsindependently of the previous round. For example, a further round ofrecombination can be effected simply by resuming the electroporation orconjugation-mediated intercellular transfer of plasmids described above.

Alternatively, a fresh substrate or substrates, the same or differentfrom previous substrates, can be transfected into cells survivingselection/screening. Optionally, the new substrates are included inplasmid vectors bearing a different selective marker and/or from adifferent incompatibility group than the original plasmids. As a furtheralternative, cells surviving selection/screening can be subdivided intotwo subpopulations, and plasmid DNA from one subpopulation transfectedinto the other, where the substrates from the plasmids from the twosubpopulations undergo a further round of recombination. In either ofthe latter two options, the rate of evolution can be increased byemploying DNA extraction, electroporation, conjugation or mutator cells,as described above. In a still further variation, DNA from cellssurviving screening/selection can be extracted and subjected to in vitrorecursive sequence recombination. After the second round ofrecombination, a second round of screening/selection is performed,preferably under conditions of increased stringency. If desired, furtherrounds of recombination and selection/screening can be performed usingthe same strategy as for the second round.

With successive rounds of recombination and selection/screening, thesurviving recombined substrates evolve toward acquisition of a desiredphenotype. Typically, in this and other methods of recursiverecombination, the final product of recombination that has acquired thedesired phenotype differs from starting substrates at 0.1%-50% ofpositions and has evolved at a rate orders of magnitude in excess (e.g.,by at least 10-fold, 100-fold, 1000-fold, or 10,000 fold) of the rate ofnaturally acquired mutation of about 1 mutation per 10⁻⁹ positions pergeneration (see Anderson et al. Proc. Natl. Acad. Sci. U.S.A. 93:906-907 (1996)). The “final product” may be transferred to another hostmore desirable for utilization of the “stochastic &/or non-stochasticmutagenized” DNA.

This is particularly advantageous in situations where the more desirablehost is less efficient as a host for the many cycles ofmutation/recombination due to the lack of molecular biology or genetictools available for other organisms such as E. coli.

4.6.1.2.2.2 Virus-Plasmid Recombination

The strategy used for plasmid-plasmid recombination can also be used forvirus-plasmid recombination; usually, phage-plasmid recombination.However, some additional comments particular to the use of viruses areappropriate.

The initial substrates for recombination are cloned into both plasmidand viral vectors. It is usually not critical which substrate(s) areinserted into the viral vector and which into the plasmid, althoughusually the viral vector should contain different substrate(s) from theplasmid. As before, the plasmid (and the virus) typically contains aselective marker.

The plasmid and viral vectors can both be introduced into cells bytransfection as described above. However, a more efficient procedure isto transfect the cells with plasmid, select transfectants and infect thetransfectants with virus. Because the efficiency of infection of manyviruses approaches 100% of cells, most cells transfected and infected bythis route contain both a plasmid and virus bearing differentsubstrates.

Homologous recombination occurs between plasmid and virus generatingboth recombined plasmids and recombined virus. For some viruses, such asfilamentous phage, in which intracellular DNA exists in bothdouble-stranded and single-stranded forms, both can participate inrecombination.

Provided that the virus is not one that rapidly kills cells,recombination can be augmented by use of electroporation or conjugationto transfer plasmids between cells. Recombination can also be augmentedfor some types of virus by allowing the progeny virus from one cell toreinfect other cells. For some types of virus, virus infected-cells showresistance to superinfection. However, such resistance can be overcomeby infecting at high multiplicity and/or using mutant strains of thevirus in which resistance to superinfection is reduced.

The result of infecting plasmid-containing cells with virus depends onthe nature of the virus. Some viruses, such as filamentous phage, stablyexist with a plasmid in the cell and also extrude progeny phage from thecell. Other viruses, such as lambda having a cosmid genome, stably existin a cell like plasmids without producing progeny virions.

Other viruses, such as the T-phage and lytic lambda, undergorecombination with the plasmid but ultimately kill the host S cell anddestroy plasmid DNA. For viruses that infect cells without killing thehost, cells containing recombinant plasmids and virus can bescreened/selected using the same approach as for plasmid-plasmidrecombination. Progeny virus extruded by cells survivingselection/screening can also be collected and used as substrates insubsequent rounds of recombination. For viruses that kill their hostcells, recombinant genes resulting from recombination reside only in theprogeny virus. If the screening or selective assay requires expressionof recombinant genes in a cell, the IS recombinant genes should betransferred from the progeny virus to another vector, e.g., a plasmidvector, and retransfected into cells before selection/screening isperformed.

For filamentous phage, the products of recombination are present in bothcells surviving recombination and in phage extruded from these cells.The dual source of recombinant products provides some additional optionsrelative to the plasmid-plasmid recombination. For example, DNA can beisolated from phage particles for use in a round of in vitrorecombination. Alternatively, the progeny 2S phage can be used totransfect or infect cells surviving a previous round ofscreening/selection, or fresh cells transfected with fresh substratesfor recombination.

4.6.1.2.23 Virus-Virus Recombination

The principles described for plasmid-plasmid and plasmid-viralrecombination can be applied to virus-virus recombination with a fewmodifications. The initial substrates for recombination are cloned intoa viral vector. Usually, the same vector is used for all substrates.

Preferably, the virus is one that, naturally or as a result of mutation,does not kill cells. After insertion, some viral genomes can be packagedin vitro or using a packaging cell line. The packaged viruses are usedto infect cells at high multiplicity such that there is a highprobability that a cell will receive multiple viruses bearing differentsubstrates.

After the initial round of infection, subsequent steps depend on thenature of infection as discussed in the previous section. For example,if the viruses have phagemid genomes such as lambda cosmids or M13, F1or Fd phagemids, the phagemids behave as plasmids within the cell andundergo recombination simply by propagating the cells. Recombination isparticularly efficient between single-stranded forms of intracellularDNA. Recombination can be augmented by electroporation of cells.

Following selection/screening, cosmids containing recombinant genes canbe recovered from surviving cells, e.g., by heat induction of acos-lysogenic host cell, or extraction of DNA by standard procedures,followed by repackaging cosmid DNA in vitro.

If the viruses are filamentous phage, recombination of replicating formDNA occurs by propagating the culture of infected cells.Selection/screening identifies colonies of cells containing viralvectors having recombinant genes with improved properties, together withphage extruded from such cells. Subsequent options are essentially thesame as for plasmid-viral recombination.

4.6.1.2.2.4 Chromosome Recombination

This format can be used to especially evolve chromosomal substrates. Theformat is particularly useful in situations in which many chromosomalgenes contribute to a phenotype or one does not know the exact locationof the chromosomal gene(s) to be evolved. The initial substrates forrecombination are cloned into a plasmid vector. If the chromosomalgene(s) to be evolved are known, the substrates constitute a family ofsequences showing a high degree of sequence identity but some divergencefrom the chromosomal gene. If the chromosomal genes to be evolved havenot been located, the initial substrates usually constitute a library ofDNA segments of which only a small number show sequence identity to thegene or gene(s) to be evolved. Divergence between plasmid-bome substrateand the chromosomal gene(s) can be induced by mutagenesis or byobtaining the plasmid-borne substrates from a different species thanthat of the cells bearing the chromosome.

The plasmids bearing substrates for recombination are transfected intocells having chromosomal gene(s) to be evolved. Evolution can occursimply by propagating the culture, and can be accelerated bytransferring plasmids between cells by conjugation or electroporation.Evolution can be further accelerated by use of mutator host cells or byseeding a culture of nonmutator host cells being evolved with mutatorhost cells and inducing intercellular transfer of plasmids byelectroporation or conjugation. Preferably, mutator host cells used forseeding contain a negative selectable marker to facilitate isolation ofa pure culture of the nonmutator cells being evolved.Selection/screening identifies cells bearing chromosomes and/or plasmidsthat have evolved toward acquisition of a desired function.

Subsequent rounds of recombination and selection/screening proceed insimilar fashion to those described for plasmid-plasmid recombination.For example, further recombination can be effected by propagating cellssurviving recombination in combination with electroporation orconjugative transfer of plasmids. Alternatively, plasmids bearingadditional substrates for recombination can be introduced into thesurviving cells. Preferably, such plasmids are from a differentincompatibility group and bear a different selective marker than theoriginal plasmids to allow selection for cells containing at least twodifferent plasmids. As a further alternative, plasmid and/or chromosomalDNA can be isolated from a subpopulation of surviving cells andtransfected into a second subpopulation. Chromosomal DNA can be clonedinto a plasmid vector before transfection.

4.6.1.2.2.5 Virus-Chromosome Recombination

As in the other methods described above, the virus is usually one thatdoes not kill the cells, and is often a phage or phagemid. The procedureis substantially the same as for plasmid-chromosome recombination.Substrates for recombination are cloned into the vector. Vectorsincluding the substrates can then be transfected into cells or in vitropackaged and introduced into cells by infection. Viral genomes recombinewith host chromosomes merely by propagating a culture. Evolution can beaccelerated by allowing intercellular transfer of viral genomes byelectroporation, or reinfection of cells by progeny virions.Screening/selection identifies cells having chromosomes and/or viralgenomes that have evolved toward acquisition of a desired function.

There are several options for subsequent rounds of recombination. Forexample, viral genomes can be transferred between cells survivingselection/recombination by electroporation. Alternatively, virusesextruded from cells surviving selection/screening can be pooled and usedto superinfect the cells at high multiplicity. Alternatively, freshsubstrates for recombination can be introduced into the cells, either onplasmid or viral vectors.

4.6.1.2.2.6 Poolwise Whole Genome Recombination

Asexual evolution is a slow and inefficient process. Populations move asindividuals rather than as a group. A diverse population is generated bymutagenesis of a single parent, resulting in a distribution of fit andunfit individuals. In the absence of a sexual cycle, each piece ofgenetic information for the surviving population remains in theindividual mutants. Selection of the fittest results in many fitindividuals being discarded, along with the genetically usefulinformation they carry. Asexual evolution proceeds one genetic event ata time, and is thus limited by the intrinsic value of a single geneticevent. Sexual evolution moves more quickly and efficiently. Matingwithin a population consolidates genetic information within thepopulation and results in useful information being combined together.

The combining of useful genetic information results in progeny that aremuch more fit than their parents. Sexual evolution thus proceeds muchfaster by multiple genetic events. These differences are furtherillustrated herein. In contrast to sexual evolution, DNA stochastic &/ornon-stochastic mutagenesis is the recursive mutagenesis, recombination,and selection of DNA sequences.

Sexual recombination in nature effects pairwise recombination andresults in progeny that are genetic hybrids of two parents. In contrast,DNA stochastic &/or non-stochastic mutagenesis in vitro effects poolwiserecombination, in which progeny are hybrids of multiple parentalmolecules. This is because DNA stochastic &/or non-stochasticmutagenesis effects many individual pairwise recombination events witheach thermal cycle. After many cycles the result is a repetitivelyinbred population, with the “progeny” being the Fx (for X cycles ofstochastic &/or non-stochastic mutagenesis) of the original parentalmolecules. These progeny are potentially descendants of many or all ofthe original parents. One can graph to show a plot of the potentialnumber of mutations an individual can accumulate by sequential, pairwiseand poolwise recombination.

Poolwise recombination is an important feature to DNA stochastic &/ornon-stochastic mutagenesis in that it provides a means of generating agreater proportion of the possible combinations of mutations from asingle “breeding” experiment. In this way, the “genetic potential” of apopulation can be readily assessed by screening the progeny of a singleDNA shufflmig experiment.

For example, if a population consists of 10 single mutant parents, thereare 210=1024 possible combinations of those mutations ranging fromprogeny having 0-10 mutations. Of these 1024, only 56 will result from asingle pairwise cross (i.e those having 0, 1, and 2 mutations). Innature the multiparent combinations will eventually arise after multiplerandom sexual matings, assuming no selection is imparted to remove somemutations from the population. In this way, sex effects theconsolidation and sampling of all useful mutant combinations possiblewithin a population. For the purposes of directed evolution, having thegreatest number of mutant combinations entering a screen or selection isdesirable so that the best progeny (i.e., according to the selectioncriteria used in the selection screen) is identified in the shortestpossible time.

One challenge to in vivo and whole genome stochastic &/or non-stochasticmutagenesis is devising methods for effecting poolwise recombination ormultiple repetitive pairwise recombination events. In crosses with asingle pairwise cross per cycle before screening, the ability to screenthe “genetic potential” of the starting population is limited. For thisreason, the rate of in vivo and whole genome stochastic &/ornon-stochastic mutagenesis mediated cellular evolution would befacilitated by effecting poolwise recombination. Two strategies forpoolwise recombination are described below (protoplast fusion andtransduction).

4.6.1.2.2.7 Protoplast Fusion

Protoplast fusion (discussed supra) mediated whole genome stochastic&/or non-stochastic mutagenesis is one format that can directly effectpoolwise recombination. Whole gene stochastic &/or non-stochasticmutagenesis is the recursive recombination of whole genomes, in the formof one or more nucleic acid molecule(s) (fragments, chromosomes,episomes, etc), from a population of organisms, resulting in theproduction of new organisms having distributed genetic information fromat least two of the starting population of organisms. The process ofprotoplast fusion is further illustrated in herein.

Progeny resulting from the fusion of multiple parent protoplasts havebeen observed (Hopwood & Wright, 1978), however, these progeny are rare(10⁻⁴-10⁻⁶). The low frequency is attributed to the distribution offusants arising from two, three, four, etc parents and the likelihood ofthe multiple recombination events (6 crossovers for a four parent cross)that would have to occur for multiparent progeny to arise. Thus, it isuseful to enrich for the multiparent progeny. This can be accomplished,e.g., by repetitive fusion or enrichment for multiply fused protoplasts.The process of poolwise fusion and recombination is further illustratedherein.

4.6.1.2.2.8 Repetitive Fusion

Protoplasts of identified parental cells are prepared, fused andregenerated. Protoplasts of the regenerated progeny are then, withoutscreening or enrichment, formed, fused and regenerated. This can becarried out for two, three, or more cycles before screening to increasethe representation of multiparent progeny. The number of possiblemutations/progeny doubles for each cycle. For example, if one crossproduces predominantly progeny with 0, 1, and 2 mutations, a breeding ofthis population with itself will produce progeny with 0, 1, 2, 3, and 4mutations, the third cross up to eight, etc. The representation of themultiparent progeny from these subsequent crosses will not be as high asthe single and double parent progeny, but it will be detectable and muchhigher than from a single cross. The repetitive fusion prior toscreening is analogous to many sexual crosses within a population, andthe individual thermal cycles of in vitro DNA stochastic &/ornon-stochastic mutagenesis described supra. A factor effecting the valueof this approach is the starting size of the parental population. As thepopulation grows, it becomes more likely that a multiparent fusion willarise from repetitive fusions. For example, if 4 parents are fusedtwice, the 4 parent progeny will make up approximately 0.2% of the totalprogeny. This is sufficient to find in a population of 3000 (95%confidence), but better representation is preferable. If ten parents arefused twice >20% of the progeny will be four parent offspring.

4.6.1.2.2.9 Enrichment for Multiple Fused Protoplasts

After the fusion of a population of protoplasts, the fusants aretypically diluted into hypotonic medium, to dilute out the fusing agent(e.g., 50% PEG). The fused cells can be grown for a short period toregenerate cell walls or separated directly and are then separated onthe basis of size. This is carried out, e.g., by cell sorting, usingfight dispersion as an estimate of size, to isolate the largest fusants.Alternatively the fusants can be sorted by FACS on the basis of DNAcontent. The large fusants or those containing more DNA result from thefusion of multiple parents and are more likely to segregate tomultiparent progeny. The enriched fusants are regenerated and screeneddirectly or the progeny are fused recursively as above to further enrichthe population for diverse mutant combinations.

4.6.1.2.2.10 Transduction

Transduction can theoretically effect poolwise recombination, if thetransducing phage particles contain predominantly host genomic DNArather than phage DNA. If phage DNA is overly represented, then mostcells will receive at least one undesired phage genome.

Phage particles generated from locked-in-prophage (supra) are useful forthis purpose. A population of cells is infected with an appropriatetransducing phage, and the lysate is collected and used to infect thesame starting population. A high multiplicity of infection is employedto deliver multiple genomic fragments to each infected cell, therebyincreasing the chance of producing recombinants containing mutationsfrom more than two parent genomes.

The resulting transductants are recovered under conditions where phagecan not propagate e.g., in the presence of citrate. This population isthen screened directly or infected again with phage, with the resultingtransducing particles being used to transduce the first progeny. Thiswould mimic recursive protoplast fusion, multiple sexual recombination,and in vitro DNA stochastic &/or non-stochastic mutagenesis.

4.6.1.2.2.11 Methods for Whole Genome Stochastic &/or Non-StochasticMutagenesis by Blind Family Stochastic &/or NON-Stochastic Mutagenesisof Parsed Genomes and Recursive Cycles of Forced Integration andExcision by Homologous Recombination, and Screening for ImprovedPhenotypes

In vitro methods have been developed to reassemble single genes andoperons, as set forth, e.g., herein. “Family” stochastic &/ornon-stochastic mutagenesis of homologous genes within species and fromdifferent species is also an effective methods for acceleratingmolecular evolution. This section describes additional methods forextending these methods such that they can be applied to whole genomes.

In some cases, the genes that encode rate limiting steps in abiochemical process, or that contribute to a phenotype of interest areknown. This method can be used to target family stochastic &/ornon-stochastic mutagenized libraries to such loci, generating librariesof organisms with high quality family stochastic &/or non-stochasticmutagenized libraries of alleles at the locus of interest. An example ofsuch a gene would be the evolution of a host chaperonin to moreefficiently chaperone the folding of an overexpressed protein in E.coli.

The goals of this process are to reassemble homologous genes from two ormore species and to then integrate the stochastic &/or non-stochasticmutagenized genes into the chromosome of a target organism.

Integration of multiple stochastic &/or non-stochastic mutagenized genesat multiple loci can be achieved using recursive cycles of integration(generating duplications), excision (leaving the improved allele in thechromosome) and transfer of additional evolved genes by seriallyapplying the same procedure.

In the first step, genes to be stochastic &/or non-stochasticmutagenized into suitable bacterial vectors are subcloned. These vectorscan be plasmids, cosmids, BACS or the like. Thus, fragments from 100 bpto 100 kb can be handled. Homologous fragments are then “familystochastic &/or non-stochastic mutagenized” together (i.e. homologousfragments from different species or chromosomal locations arehomologously recombined). As a simple case, homologs from two species(say, E. coli and Salmonella) are cloned, family stochastic &/ornon-stochastic mutagenized in vitro and cloned into an allelereplacement vector (e.g., a vector with a positively selectable marker,a negatively selectable marker and conditionally active origin ofreplication). The basic strategy for whole genome family stochastic &/ornon-stochastic mutagenesis of parsed (subcloned) genomes is additionallyset forth herein.

The vectors are transfected into E. coli and selected, e.g., for drugresistance. Most drug resistant cells should arise by homologousrecombination between a family stochastic &/or non-stochasticmutagenized insert and a chromosomal copy of the cloned insert. Colonieswith improved phenotype are screened (e.g., by mass spectroscopy forenzyme activity or small molecule production, or a chromogenic screen,or the like, depending on the phenotype to be assayed). Negativeselection (i.e. sue selection) is imposed to force excision of tandemduplication. Roughly half C, of the colonies should retain the improvedphenotype. Importantly, this process regenerates a “clean” chromosome inwhich the wild type locus is replaced with a family stochastic &/ornon-stochastic mutagenized fragment that encodes a beneficial allele.Since the chromosome is “clean” (i.e., has no vector sequences), otherimproved alleles can also be moved into this point on the chromosome byhomologous recombination.

Selection or screening for improved phenotype can occur either afterstep 3 or step 4. If selection or screening takes place after step 3,then the improved allele can be conveniently moved to other strains by,for example, P I transduction. One can then regenerate a straincontaining the improved allele but lacking vector sequences by “negativeselection” against the suc marker. In subsequent rounds, independentlyidentified improved variants of the gene can be sequentially moved intothe improved strain (e.g., by P I transduction of the drug marked tandemduplication above). Transductants are screened for further improvementin phenotype by virtue of receiving the transduced tandem duplication,which itself contains the family stochastic &/or non-stochasticmutagenized genetic material. Negative selection is again imposed andthe process of stochastic &/or non-stochastic mutagenesis the improvedstrain is recursively repeated as desired.

Although this process was described with reference to targeting a geneor genes of interest, it can be used “blindly,” making no assumptionsabout which locus is to be targeted. This procedure is set forth herein.For example, the whole genome of an organism of interest is cloned intomanageable fragments (e.g., 10 kb for plasmid-based methods). Homologousfragments are then isolated from related species. Forced recombinationwith chromosomal homologs creates chimeras.

4.6.1.2.2.12 Methods for High Throughput Family Stochastic &/orNon-Stochastic Mutagenesis of Genes

For E. coli., cloning the genome in 10 kb fragments requires about 300clones. The homologous fragments are isolated, e.g., from Salmonella.This gives roughly three hundred pairs of homologous fragments. Eachpair is family stochastic &/or non-stochastic mutagenized and thestochastic &/or non-stochastic mutagenized fragments are cloned into anallele replacement vector. The inserts are integrated into the E. coligenome as described above. A global screen is made to identify variantswith an improved phenotype. This serves as the basis collection ofimprovements that are to be stochastic &/or non-stochastic mutagenizedto produce a desired strain. The stochastic &/or non-stochasticmutagenesis of these independently identified variants into one superstrain is done as described above.

Family stochastic &/or non-stochastic mutagenesis has been shown to bean efficient method for creating high quality libraries of geneticvariants. Given a cloned gene from one species, it is of interest toquickly and rapidly isolate homologs from other species, and thisprocess can be rate limiting. For example, if one wants to performfamily stochastic &/or non-stochastic mutagenesis on an entire genome,one may need to construct hundreds to thousands of individual familystochastic &/or non-stochastic mutagenized libraries.

In this embodiment, a gene of interest is optionally cloned into avector in which ssDNA can be made. An example of such a vector is aphagemid vector with an Ml 3 origin of replication. Genomic DNA or cDNAfrom a species of interest is isolated, denatured, annealed to thephagemid, and then enzymatically manipulated to clone it. The cloned DNAis then used to family reassemble with the original gene of interest.PCR based formats are also available. These formats require nointermediate cloning steps, and are, therefore, of particular interestfor high throughput applications.

Alternatively, the gene of interest can be fished out using purifiedRecA protein. The gene of interest is PCR amplified using primers thatare tagged with an affinity tag such as biotin, denatured, then coatedwith RecA protein (or an improved variant thereof). The coated ssDNA isthen mixed with a gDNA plasmid library. Under the appropriateconditions, such as in the presence of non-hydrolyzable rATP analogs,RecA will catalyze the hybridization of the RecA coated gene (ssDNA) inthe plasmid library. The heteroduplex is then affinity purified from thenon-hybridizing plasmids of the gene library by adsorbtion of thelabeled PCR products and its associated homologous DNA to an appropriateaffinity matrix.

The homologous DNA is used in a family stochastic &/or non-stochasticmutagenesis reaction for improvement of the desired function. Stochastic&/or non-stochastic mutagenesis the E. coli chaperonin gene DnaJ withother homologs is described below as an example. The example can begeneralized to any other gene, including eukaryotic genes such as plantor animal genes (including mammalian genes), by following the formatdescribed.

As a first step, the E. coli Dna1 gene is cloned into an M13 phagemidvector. ssDNA is then produced, preferably in a dut(−) ung(−) strain sothat Kunkel site directed mutagenesis protocols can be applied. GenomicDNA is then isolated from a non-E. coli source, such as Salmonella andYersinia Pestis. The bacterial genomic DNAs are denatured and reannealedto the phagemid ssDNA (e.g., about 1 microgram of ssDNA). The reannealedproduct is treated with an enzyme such as Mung Bean nuclease thatdegrades ssDNA as an exonuclease but not as an endonuclease (thenuclease does not degrade mismatched DNA that is embedded in a largerannealed fragment). The standard Kunkel site directed mutagenesisprotocol is used to extend the fragment and the target cells aretransformed with the resulting mutagenized DNA.

In a first variation on the above, the procedure is adapted to thesituation where the target gene or genes of interest are unknown. Inthis variation, the whole genome of the organism of interest is clonedin fragments (e.g., of about 10 kb each) into a phagemid. Singlestranded phagemid DNA is then produced. Genomic DNA from the relatedspecies is denatured and annealed to the phagemids. Mung bean nucleaseis used to trim away unhybridized DNA ends. Polymerase plus ligase isused to fill in the resulting gapped circles.

These clones are transformed into a mismatch repair deficient strain.When the mismatched molecules are replicated in the bacteria, mostcolonies contain both the E. coli and the homologous fragment. The twohomologous genes are then isolated from the colonies (e.g., either bystandard plasmid purification or colony PCR) and stochastic &/ornon-stochastic mutagenized.

Another approach to generating chimeras that requires no in vitrostochastic &/or non-stochastic mutagenesis is simply to clone theSalmonella genome into an allele replacement vector, transform E. coli,and select for chromosomal integrants. Homologous recombination betweenSalmonella genes and E. coli homologs generate stochastic &/ornon-stochastic mutagenized chimeras. A global screen is done to screenfor improved phenotypes. Alternately, recursive transformation andrecombination is performed to increase diversity prior to screening. Ifcolonies with improved phenotypes are obtained, it is verified that theimprovement is due to allele replacement by P 1 transduction into afresh strain and counterscreening for improved phenotype. A collectionof such improved alleles can then be combined into one strain using themethods for whole genome stochastic &/or non-stochastic mutagenesis byblind family stochastic &/or non-stochastic mutagenesis of parsedgenomes as set forth herein. Additionally, once these loci areidentified, it is likely that further rounds of stochastic &/ornon-stochastic mutagenesis and screening will yield furtherimprovements. This could be done by cloning the chimeric gene and thenusing the methods described in this disclosure to breed the gene withhomologs from many different strains of bacteria.

In general, the transformants contain clones of the homologue of thetarget gene (e.g., E. coli DnaJ in the example above). Mismatch repairin vivo results in a decrease in diversity of the gene. There are atleast two solutions to this. First, transduction can be performed into amismatch repair deficient strain. Alternatively or in addition, the Ml 3template DNA can be selectively degraded, leaving the cloned homologue.This can be done using methods similar to the standard Eckstein sitedirected mutagenesis technique (General texts which describe generalmolecular biological techniques useful herein, including mutagenesis,include Sambrook et al., Molecular Cloning—A Laboratory Manual (2ndEd.), Vol. 1-3, Cold Spring Harbor Laboratory, Cold Spring Harbor, N.Y.,1989 (“Sambrook”) and Current Protocols in Molecular Biology, F. M.Ausubel et al., eds., Current Protocols, a joint venture between GreenePublishing Associates, Inc. and John Wiley & Sons, Inc., (supplementedthrough 1998) (“Ausubel”)).

This method relies on incorporation of alpha thiol modified dNTPs duringsynthesis of the new strand followed by selective degradation of thetemplate and resynthesis of the template strand. In one embodiment, thetemplate strand is grown in a dut(−) ung(−) strain so that uracil isincorporated into the phagemid DNA. After extension as noted above (andbefore transformation) the DNA is treated with uracil glycosylate and anapurinic site endonuclease such as Endo III or Endo IV. The treated DNAis then treated with a processive exonuclease that resects from theresulting gaps while leaving the other strand intact (as in Ecksteinmutagenesis). The DNA is polymerized and ligated. Target cells are thentransformed. This process enriches for clones encoding the homologuewhich is not derived from the target (i.e., in the example above, thenon-E. coli. homologue).

An analogous procedure is optionally performed in a PCR format. Asapplied to the DnaJ illustration above, Dnal DNA is amplified by PCRwith primers that build 30-mer priming sites on each end. The PCR isdenatured and annealed with an excess of Salmonella genomic DNA. TheSalmonella DnaJ gene hybribidizes with the E. coli homologue. Aftertreatment with Mung Bean nuclease, the resulting mismatched hybrid isPCR amplified with the flanking 30-mer primers. This PCR product can beused directly for family stochastic &/or non-stochastic mutagenesis. Asgenomics provides an increasing amount of sequence information, it isincreasingly possible to directly PCR amplify homologs with designedprimers. For example, given the sequence of the E. coli genome and of arelated genome (i.e. Salmonella), each genome can be PCR amplified withdesigned primers in, e.g., 5 kb fragments. The homologous fragments canbe put together in a pairwise fashion for stochastic &/or non-stochasticmutagenesis. For genome stochastic &/or non-stochastic mutagenesis, thestochastic &/or non-stochastic mutagenized products are cloned into theallele replacement vector and bred into the genome as described supra.

4.6.1.2.2.13 Hyper-Recombinogenic RecA Clones

The invention further provides hyper-recombinogenic RecA proteins (see,the examples below). It is fully expected that one of skill can make avariety of related recombinogenic proteins given the disclosedsequences.

Standard molecular biological techniques can be used to make nucleicacids which comprise the given nucleic acids, e.g., by cloning thenucleic acids into any known vector. Examples of appropriate cloning andsequencing techniques, and instructions sufficient to direct persons ofskill through many cloning exercises are found in Berger and Kimmel,Guide to Molecular Cloning Techniques, Methods in Enzymology volume 152Academic Press, Inc., San Diego, Calif. (Berger); Sambrook et al. (1989)Molecular Cloning—A Laboratory Manual (2nd ed.) Vol. 1-3, Cold SpringHarbor Laboratory, Cold Spring Harbor Press, NY, (Sambrook); and CurrentProtocols in Molecular Biology, F. M. Ausubel et al., eds., CurrentProtocols, a joint venture between Greene Publishing Associates, Inc.and John Wiley & Sons, Inc., (1994 Supplement) (Ausubel). Productinformation from manufacturers of biological reagents and experimentalequipment also provide information useful in known biological methods.Such manufacturers include the SIGMA chemical company (Saint Louis,Mo.), R&D systems (Minneapolis, Minn.), Pharmacia LKB Biotechnology(Piscataway, N.J.), CLONTECH Laboratories, Inc. (Palo Alto, Calif.),Chem Genes Corp., Aldrich Chemical Company (Milwaukee, Wis.), GlenResearch, Inc., GIBCO BRL Life Technologies, Inc. (Gaithersberg,Mich.)), Fluka Chemica-Biochemika Analytika (Fluka Chemie AG, Buchs,Switzerland), Invitrogen, San Diego, Calif., and Applied Biosystems(Foster City, Calif.), as well as many other commercial sources known toone of skill.

It will be appreciated that conservative substitutions of the givensequences can be used to produce nucleic acids which encodehyperrecombinogenic clones.

“Conservatively modified variations” of a particular nucleic acidsequence refers to those nucleic acids which encode identical oressentially identical amino acid sequences, or where the nucleic aciddoes not encode an amino acid sequence, to essentially identicalsequences. Because of the degeneracy of the genetic code, a large numberof functionally identical nucleic acids encode any given polypeptide.For instance, the codons CGU, CGC, CGA, CGG, AGA, and AGG all encode theamino acid arginine. Thus, at every position where an arginine isspecified by a codon, the codon can be altered to any of thecorresponding codons described without altering the encoded polypeptide.Such nucleic acid variations are “silent variations,” which are onespecies of “conservatively modified variations.” Every nucleic acidsequence herein which encodes a polypeptide also describes everypossible silent variation. One of skill will recognize that each codonin a nucleic acid (except AUG, which is ordinarily the only codon, formethionine) can be modified to yield a functionally identical moleculeby standard techniques. Accordingly, each “silent variation” of anucleic acid which encodes a polypeptide is implicit in any describedsequence. Furthermore, one of skill will recognize that individualsubstitutions, deletions or additions which alter, add or delete asingle amino acid or a small percentage of amino acids (typically lessthan 5%, more typically less than 1%) in an encoded sequence are“conservatively modified variations” where the alterations result in thesubstitution of an amino acid with a chemically similar amino acid.Conservative substitution tables providing functionally similar aminoacids are well known in the art. The following six groups each containamino acids that are conservative substitutions for one another. 1)Alanine (A), Serine (S), Threonine (T); 2) Aspartic acid (D), Glutamicacid (E); 3) Asparagine (N), Glutamine (Q); 4) Arginine (R), Lysine (K);5) Isoleucine (1), Leucine (L), Methionine (M), Valine (V); and 6)Phenylalanine (F), Tyrosine (Y), Tryptophan (W). See also, Creighton(1984) Proteins W.H. Freeman and Company. Finally, the addition ofsequences which do not alter the encoded activity of a nucleic acidmolecule, such as a non-functional sequence is a conservativemodification of the basic nucleic acid.

One of skill will appreciate that many conservative variations of thenucleic acid constructs disclosed yield a functionally identicalconstruct. For example, due to the degeneracy of the genetic code,“silent substitutions” (ie., substitutions of a nucleic acid sequencewhich do not result in an alteration in an encoded polypeptide) are animplied feature of every nucleic acid sequence which encodes an aminoacid. Similarly, “conservative amino acid substitutions,” in one or afew amino acids in an amino acid sequence of a packaging or packageableconstruct are substituted with different amino acids with highly similarproperties, are also readily identified as being highly similar to adisclosed construct. Such conservatively substituted variations of eachexplicitly disclosed sequence are a feature of the present invention.

Nucleic acids which hybridize under stringent conditions to the nucleicacids in the figures are a feature of the invention. “Stringenthybridization wash conditions” in the context of nucleic acidhybridization experiments such as Southern and northern hybridizationsare sequence dependent, and are different under different environmentalparameters. An extensive guide to the hybridization of nucleic acids isfound in Tijssen (1993) Laboratory Techniques in Biochemistry andMolecular Biology-Hybridization with Nucleic Acid Probes part I chapter2 “overview of principles of hybridization and the strategy of nucleicacid probe assays”, Elsevier, New York. Generally, highly stringenthybridization and wash conditions are selected to be about 5C lower thanthe thermal melting point (T.) for the specific sequence at a definedionic strength and ph. The T. is the temperature (under defined ionicstrength and pH) at which 50% of the target sequence hybridizes to aperfectly matched probe. Very stringent conditions are selected to beequal to the T. for a particular probe. In general, a signal to noiseratio of 2× (or higher) than that observed for an unrelated probe in theparticular hybridization assay indicates detection of a specifichybridization.

Nucleic acids which do not hybridize to each other under stringentconditions are still substantially identical if the polypeptides whichthey encode are substantially identical. This occurs, e.g., when a copyof a nucleic acid is created using the maximum codon degeneracypermitted by the genetic code.

Finally, preferred nucleic acids encode hyper-recombinogenic RecAproteins which are at least one order of magnitude (10 times) as activeas a wild-type RecA protein in a standard assay for RecA activity.

4.6.1.2.2.14 RecE/RecT Mediated Stochastic &/or Non-StochasticMutagenesis In Vivo

Like recA, recE and recT (or their homologues, for example the lambdarecombination proteins red and red) can stimulate homologousrecombination in vivo. See, Muyrers et al. (1999) Nucleic Acids Res27(6): 1555-7 and Zhang et al. (1998) Nat Genet (2): 123-8Hyper-recombinogenic recE and recT are evolved by the same method asdescribed for recA. Alternatively, variants with increasedrecombinogenicity are selected by their ability to cause recombinationbetween a suicide vector (lacking an origin of replication) carrying aselectable marker, and a homologous region in either the chromosome or astably-maintained episome.

A plasmid containing recA and recE genes is stochastic &/ornon-stochastic mutagenized (either using these genes as single startingpoints, or by family stochastic &/or non-stochastic mutagenesis (withfor example red and red, or other homologous genes identified fromavailable sequence databases). This stochastic &/or non-stochasticmutagenized library is then cloned into a vector with a selectablemarker and transformed into an appropriate recombination-deficientstrain. The library of cells would then be transformed with a secondselectable marker, either borne on a suicide vector or as a linear DNAfragment with regions at its ends that are homologous to a targetsequence (either in the plasmid or in the host chromosome). Integrationof this marker by homologous recombination is a selectable event,dependent on the activity of the recE and recT gene products. TherecE/recT genes are isolated from cells in which homologousrecombination has occurred. The process is repeated several times toenrich for the most efficient variants before the next round ofstochastic &/or non-stochastic mutagenesis is performed. In addition,cycles of recombination without selection can be performed to increasethe diversity of a cell population prior to selection.

Once hyper-recombinogenic recE/recT genes are isolated they are used asdescribed for hyper-recombinogenic recA. For example they are expressed(constitutively or conditionally) in a host cell to facilitatehomologous recombination between variant gene fragments and homologueswithin the host cell. They are alternatively introduced bymicroinjection, biolistics, lipofection or other means into a host cellat the same time as the variant genes.

Hyper-recombinogenic recE/recT (either of bacterial/phage origin, orfrom plant homologues) are useful for facilitating homologousrecombination in plants. They are, for example, cloned into theAgrobacterium cloning vector, where they are expressed upon entry intothe plant, thereby stimulating homologous recombination in the recipientcell.

In a preferred embodiment, recE/recT are used and or generated in mutSstrains.

4.6.1.2.2.15 Multi-Cyclic Recombination

As noted, protoplast fusion is an efficient means of recombining twomicrobial genomes. The process reproducibly results in about 10% of anon-selected population being recombinant chimeric organisms.

Protoplasts are cells that have been stripped of their cell walls bytreatment in hypotonic medium with cell wall degrading enzymes.Protoplast fusion is the induced fusion of the membranes of two or moreof these protoplasts by fusogenic agents such as polyethylene glycol.Fusion results in cytoplasmic mixing and places the genomes of the fusedcells within the same membrane. Under these conditions recombinationbetween the genomes is frequent.

The fused protoplasts are regenerated, and, during cell division, singlegenomes segregate into each daughter cell. Typically, 10% of thesedaughter cells have genomes that originate partially from more than oneof the original parental protoplast genomes.

This result is similar to that of the crossing over of sister chromatidsin eukaryotic cells during prophase of meiosis II. The percentage ofdaughter cells that are recombinant is just lower after protoplastfusion. While protoplast fusion does result in efficient recombination,the recombination predominantly occurs between two cells as in sexualrecombination.

In order to efficiently generate libraries of whole genome stochastic&/or non-stochastic mutagenized libraries, daughter cells having geneticinformation originating from multiple parents are made.

In vitro DNA stochastic &/or non-stochastic mutagenesis results in theefficient poolwise recombination of multiple homologous DNA sequences.The stochastic &/or non-stochastic mutagenesis of full length genes froma mixed pool of small gene fragments requires multiple annealing andelongation cycles, the thermal cycles of the primerless PCR reaction.During each thermal cycle, many pairs of fragments anneal and areextended to form a combinatorial population of larger chimeric DNAfragments. After the first cycle of stochastic &/or non-stochasticmutagenesis, chimeric fragments contain sequences originating from twodifferent parent genes. This is similar to the result of a single sexualcycle within a population, pairwise cross, or protoplast fusion. Duringthe second cycle, these chimeric fragments can anneal with each other,or with other small fragments, resulting in chimeras originating from upto four different parental sequences.

This second cycle is analogous to the entire progeny from a singlesexual cross inbreeding with itself. Further cycles will result inchimeras originating from 8, 16, 32, etc parental sequences and areanalogous to further inbreedings of the progeny population. The power ofin vitro DNA stochastic &/or non-stochastic mutagenesis is that a largecombinatorial library can be generated from a single pool of DNAfragments stochastic &/or non-stochastic mutagenized by these recursivepairwise “matings.” As described above, in vivo stochastic &/ornon-stochastic mutagenesis strategies, such as protoplast fusion, resultin a single pairwise mating reaction. Thus, to generate the level ofdiversity obtained by in vitro methods, in vivo methods are carried outrecursively. That is, a pool of organisms is recombined and the progenypooled, without selection, and then recombined again. This process isrepeated for sufficient cycles to result in progeny having multipleparental sequences.

Described below is a method used to reassemble four strains ofStreptomyces coelicolor. From the initial four strains each containing aunique nutritional marker, three to four rounds of recursive pooledprotoplast fusion was sufficient to generate a population of stochastic&/or non-stochastic mutagenized organisms containing all 16 possiblecombinations of the four markers. This represents a 10⁶ fold improvementin the generation of four parent progeny as compared to a single pooledfusion of the four strains.

Protoplasts were generated from several strains of S. coelicolor, pooledand fused. Mycelia were regenerated and allowed to sporulate. The sporeswere collected, allowed to grow into Mycelia, formed into protoplasts,pooled and fused and the process repeated for three to four rounds. theresulting spores were then subject to screening.

The basic protocol for generating a whole genome stochastic &/ornon-stochastic mutagenized library from four S. coelicolor strains, eachhaving one of four distinct markers, was as follows. Four mycelialcultures, each of a strain having one of four different markers, weregrown to early stationary phase. The mycelia from each were harvested bycentrifugation and washed. Protoplasts from each culture were preparedas follows. Approximately 10⁹ S. coelicolor spores were inoculated into50 ml YEME with 0.5% Glycine in a 250 ml baffled flask. The spores wereincubated at 30C for 36-40 hours in an orbital shaker. Mycelium wereverified using a microscope. Some strains needed an additional day ofgrowth. The culture was transferred into a 50 ml tube and centrifuged at4,000 rpm for 10 min. The mycelium were twice washed with 10.3% sucroseand centriftiged at 4,000 rpm for 10 min. (mycelium can be stored atabout 80C after wash). 5 ml of lysozyme was added to the about 0.5 g ofmycelium pellet. The pellet was suspended and incubated at 30C for 20-60min., with gentle shaking every 10 min. The microscope was checked forprotoplasting every 20 min. Once the majority were protoplasts,protoplasting was stopped by adding 10 ml of P buffer. The protoplastswere filtered through cotton and the protoplast spun down at 3,000 rpmfor 7 min at room temperature. The supernatant was discarded and theprotoplast gently resuspended, adding a suitable amount of P bufferaccording to the pellet size (usually about 500W). Ten-fold serialdilutions were made in P buffer, and the protoplasts counted at a 10¹²dilution. Protoplasts were adjusted to 1010 protoplasts per ml.

The protoplasts from each culture were quantitated by microscopy. 108protoplast from each culture were mixed in the same tube, washed, andthen fused by the addition of 50% PEG. The fused protoplasts werediluted and plated regeneration medium and incubated until the colonieswere sporulating (four days). Spores were harvested and washed. Thesespores represent a pool of all the recombinants and parents form thefusion.

A sample of the pooled spores was then used to inoculate a single liquidculture. The culture was grown to early stationary phase, the mycleliaharvested, and protoplasts prepared. 10⁸ protoplasts from this “myceliallibrary” were then fused with themselves by the addition of 50% PEG. Theprotoplast fusion/regeneration/harvesting/protoplast preparation stepswere repeated two times. The spores resulting from the fourth round offusion were considered the “whole genome stochastic &/or non-stochasticmutagenized library” and they were screened for the frequency of the 16possible combinations of the four markers

In particular, adding rounds of recombination prior to selectionproduced significant increases in the number of clones whichincorporated all four of the relevant selectable markers, indicatingthat the population became increasingly diverse be recursive pooling andsporulation.

The four strains of the four parent stochastic &/or non-stochasticmutagenesis were each auxotrophic for three and prototrophic for one offour possible nutritional markers: arginine (A), cystine (C), proline(P), and/or uracil (U). Spores from each fusion were plated in each ofthe 16 possible combinations of these four nutrients, and the percent ofthe population growing on a particulate medium was calculated as theration of those colonies form a selective plate to those growing on aplate having all four nutrients (all variants grow on the medium havingall four nutrients, thus the colonies from this plate thus represent thetotal viable population). The corrected percentages for each of the no,one, two, and three marker phenotypes were determined by subtracting thepercentage of cells having additional markers that might grow on themedium having “unnecessary” nutrients. For example, the number ofcolonies growing on no additional nutrients (the prototroph) wassubtracted from the number of colonies growing on any plate requiringnutrients.

4.6.1.2.2.16 Whole Genome Stochastic &/or Non-Stochastic MutagenesisThrough Organized Heteroduplex Stochastic &/or Non-StochasticMutagenesis

A new procedure to optimize phenotypes of interests by heteroduplexstochastic &/or non-stochastic mutagenesis of cosmids libraries of theorganism of choice, is provided. This procedure does not requireprotoplast fusion and is applicable to bacteria for whichwell-established genetic systems are available, including cosmidcloning, transformation, in vitro packaging/transfection and plasmidtransfer/mobilization. Microorganism that can be improved by thesemethods include Escherichia coli, Pseudomonas aeruginosa, Pseudomonasputida, Pseudomonas spp., Rhizobium spp., Xanthomonas spp., and othergram-negative organisms. This method is also applicable to Gram-positivemicroorganisms.

In step A, Chromosomal DNA of the organism to be improved is digestedwith suitable restriction enzymes and ligated into a cosmid. The cosmidused for cosmid-based heteroduplex guided whole genome stochastic &/ornon-stochastic mutagenesis has at least two rare restriction enzymerecognition sites (e.g. Sfr and NotI) to be used for linearization insubsequent steps. Sufficient cosmids to represent the completechromosome are purified and stored in 96-well microtiter dishes. In stepB, small samples of the library are mutagenized in vitro usinghydroxylamine or other mutagenic chemicals. In step C, a sample fromeach well of the mutagenized collection is used to transfect the targetcells. In step D, the transfectants are assayed (as a pool from eachmutagenized sample-well) for phenotypic improvements. Positives fromthis assay indicate that a cosmid from a particular well can conferphenotypic improvements and thus contain large genomic fragments thatare suitable targets for heteroduplex mediated stochastic &/ornon-stochastic mutagenesis. In step E, the transfected cells harboring amutant library of the identified cosmid(s) are separated by plating onsolid media and screened for independent mutants conferring an improvedphenotype. In step F, DNA from positive cells is isolated and pooled byorigin. In step G, the selected cosmid pools are divided so that onesample can be digested with Sfr and the other with NotI. These samplesare pooled, denatured, reannealed, and religated.

In step K target cells are transfected with the resulting heteroduplexesand propagated to allow “recombination” to occur between the strands ofthe heteroduplexes in vivo. The transfectants can be screened (thepopulation will represent the pairwise recombinants) or, commonly, asrepresented by step 1, the recombined cosmids are further stochastic&/or non-stochastic mutagenized by recursive in vitro heteroduplexformation and in vivo recombination (to generate a completecombinatorial library of the possible mutations) prior to screening. Anadditional mutagenesis step could also be added for increased diversityduring the stochastic &/or non-stochastic mutagenesis process.

In step J, once several cosmids harboring different distributed locihave been improved, they are combined into the same host by chromosomeintegration. This organism can be used directly or subjected to a newround of heteroduplex guided whole genome stochastic &/or non-stochasticmutagenesis.

4.7. Specialized Methods

4.7.1 Targeted Stochastic &/or Non-Stochastic Mutagenesis-Hot Spots

In one aspect, targeted homologous genes are cloned into specificregions of the genome (e.g., by homologous recombination or othertargeting procedures) which are known to be recombination “hot spots”(i.e., regions showing elevated levels of recombination compared to theaverage level of recombination observed across an entire genome), orknown to be proximal to such hot spots. The resulting recombinantstrains are mated recursively. During meiotic recombination, homologousrecombinant genes recombine, thereby increasing the diversity of thegenes. After several cycles of recombination by recursive mating, theresulting cells are screened.

4.7.2 Stochastic &/or Non-Stochastic Mutagenesis Using Yeasts

Yeasts are subspecies of fungi that grow as single cells. Yeasts areused for the production of fermented beverages and leavening, forproduction of ethanol as a fuel, low molecular weight compounds, and forthe heterologous production of proteins and enzymes (see accompanyinglist of yeast strains and their uses). Commonly used strains of yeastinclude Saccharomyces cerevisiae, Pichia sp., Canidia sp. andSchizosaccharomyces pombe.

Several types of vectors are available for cloning in yeast includingintegrative plasmid (Ylp), yeast replicating plasmid (YRp, such as the 2circle based vectors), yeast episomal plasmid (YEp), yeast centromericplasmid (YCp), or yeast artificial chromosome (YAC). Each vector cancarry markers useful to select for the presence of the plasmid such asLUE2, URA3, and H1S3, or the absence of the plasmid such as URA3 (a genethat is toxic to cells grown in the presence of 5-fluoro orotic acid.

Many yeasts have a sexual cycle and asexual (vegetative) cycles. Thesexual cycle involves the recombination of the whole genome of theorganism each time the cell passes through meiosis. For example, whendiploid cells of S. cerevisiae are exposed to nitrogen and carbonlimiting conditions, diploid cells undergo meiosis to form asci. Eachascus holds four haploid spores, two of mating type “a” and two ofmating type “ ” Upon return to rich medium, haploid spores of oppositemating type mate to form diploid cells once again. Asiospores ofopposite mating type can mate within the ascus, or if the ascus isdegraded, for example with zymolase, the haploid cells are liberated andcan mate with spores from other asci. This sexual cycle provides aformat to reassemble endogenous genomes of yeast and/or exogenousfragment libraries inserted into yeast vectors. This process results inswapping or accumulation of hybrid genes, and for the stochastic &/ornon-stochastic mutagenesis of homologous sequences shared by matingcells.

Yeast strains having mutations in several known genes have propertiesuseful for stochastic &/or non-stochastic mutagenesis. These propertiesinclude increasing the frequency of recombination and increasing thefrequency of spontaneous mutations within a cell. These properties canbe the result of mutation of a coding sequence or altered expression(usually overexpression) of a wildtype coding sequence. The HO nucleaseeffects the transposition of HMLa/ and HMRa/ to the MAT locus resultingin mating type switching. Mutants in the gene encoding this enzyme donot switch their mating type and can be employed to force crossingbetween strains of defined genotype, such as ones that harbor a libraryor have a desired phenotype and to prevent in breeding of starterstrains. PMS 1, MLH 1, MSH2, MSH6 are involved in mismatch repair.Mutations in these genes all have a mutator phenotype (Chambers et al.,Mol. Cell. Biol. 16, 6110-6120 (1996)). Mutations in TOP3 DNAtopoisomerase have a 6-fold enhancement of interchromosomal homologousrecombination (Bailis et al., Molecular and Cellular Biology 12,49884993 (1992)). The RAD50-57 genes confer resistance to radiation.Rad3 functions in excision of pyrimidine dimers. RAD52 functions in geneconversion. RAD50, MREI 1, XRS2 function in both homologousrecombination and illegitimate recombination. HOP1, REDI function inearly meiotic recombination (Mao-Draayer, Genetics 144, 71-86) Mutationsin either HOPI or RED 1 reduce double stranded breaks at the HIS2recombination hotspot. Strains deficient in these genes are useful formaintaining stability in hyper recombinogenic constructs such as tandemexpression libraries carried on YACs. Mutations in HPR1 arehyperrecombinogenic. HDFI has DNA end binding activity and is involvedin double stranded break repair and V(D)J recombination.

Strains bearing this mutation are useful for transformation with randomgenomic fragments by either protoplast fusion or electroporation. Kar-1is a dominant mutation that prevents karyogamy. Kar-1 mutants are usefulfor the directed transfer of single chromosomes from a donor to arecipient strain. This technique has been widely used in the transfer ofYACs between strains, and is also useful in the transfer of evolvedgenes/chromosomes to other organisms (Markie, YA C Protocols, (HumanaPress, Totowa, N.J., 1996). HOTI is an S. cerevisiae recombinationhotspot within the promoter and enhancer region of the rDNA repeatsequences. This locus induces mitotic recombination at adjacentsequences—presumably due to its high level transcription. Genes and/orpathways inserted under the transcriptional control of this regionundergo increased mitotic recombination. The regions surrounding the arg4 and his 4 genes are also recombination hot spots, and genes cloned inthese regions have an increased probability of undergoing recombinationduring meiosis.

Homologous genes can be cloned in these regions and stochastic &/ornon-stochastic mutagenized in vivo by recursively mating the recombinantstrains. CDC2 encodes polymerase and is necessary for mitotic geneconversion. Overexpression of this gene can be used in a reassembler ormutator strain. A temperature sensitive mutation in CDC4 halts the cellcycle at G1 at the restrictive temperature and could be used tosynchronize protoplasts for optimized fusion and subsequentrecombination.

As with filamentous fungi, the general goals of stochastic &/ornon-stochastic mutagenesis yeast include improvement in yeast as a hostorganism for genetic manipulation, and as a production apparatus forvarious compounds. One desired property in either case is to improve thecapacity of yeast to express and secrete a heterologous protein. Thefollowing example describes the use of stochastic &/or non-stochasticmutagenesis to evolve yeast to express and secrete increased amounts ofRNase A.

RNase A catalyzes the cleavage of the P-0_(5′) bond of RNA specificallyafter pyrimidine nucleotides. The enzyme is a basic 124 amino acidpolypeptide that has 8 half cystine residues, each required forcatalysis. YEpWL-RNase A is a vector that effects the expression andsecretion of RNaseA from the yeast S. cerevisiae, and yeast harboringthis vector secrete 1-2 mg of recombinant RNase A per liter of culturemedium (del Cardayre et al., Protein Engineering 8(3): 26, 1-273(1995)). This overall yield is poor for a protein heterologouslyexpressed in yeast and can be improved at least 10-100 fold bystochastic &/or non-stochastic mutagenesis. The expression of RNaseA iseasily detected by several plate and microtitre plate assays (delCardayre & Raines, Biochemistry 33, 6031-6037 1994)). Each of thedescribed formats for whole genome stochastic &/or non-stochasticmutagenesis can be used to reassemble a strain of S. cerevisiaeharboring YepWL-RNase A, and the resulting cells can be screened for theincreased secretion of RNase A into the medium. The new strains arecycled recursively through the stochastic &/or non-stochasticmutagenesis format, until sufficiently high levels of RNase A secretionis observed. The use of RNase A is particularly useful since it not onlyrequires proper folding and disulfide bond formation but also properglycosylation. Thus numerous components of the expression, folding, andsecretion systems can be optimized. The resulting strain is also evolvedfor improved secretion of other heterologous proteins.

4.7.3 Reassemble to Increase Tolerance of Yeast to Ethanol

Another goal of stochastic &/or non-stochastic mutagenesis yeast is toincrease the tolerance of yeast to ethanol. Such is useful both for thecommercial production of ethanol, and for the production of morealcoholic beers and wines. The yeast strain to be stochastic &/ornon-stochastic mutagenized acquires genetic material by exchange ortransformation with other strain(s) of yeast, which may or may not beknow to have superior resistance to ethanol. The strain to be evolved isstochastic &/or non-stochastic mutagenized and shufflants are selectedfor capacity to survive exposure to ethanol. Increasing concentrationsof ethanol can be used in successive rounds of stochastic &/ornon-stochastic mutagenesis. The same principles can be used toreassemble baking yeasts for improved osmotolerance.

4.7.4 Capacity to Grow under Desired Nutritional Conditions

Another desired property of stochastic &/or non-stochastic mutagenesisyeast is capacity to grow under desired nutritional conditions. Forexample, it is useful to yeast to grow on cheap carbon sources such asmethanol, starch, molasses, cellulose, cellobiose, or xylose dependingon availability. The principles of stochastic &/or non-stochasticmutagenesis and selection are similar to those discussed for filamentousfungi.

4.7.5 To Produce Secondary Metabolites

Another desired property is capacity to produce secondary metabolitesnaturally produced by filamentous fungi or bacteria, Examples of suchsecondary metabolites are cyclosporin A, taxol, and cephalosporins. Theyeast to be evolved undergoes genetic exchange or is transformed withDNA from organism(s) that produce the secondary metabolite. For example,fungi producing taxol include Taxomyces andreanae and Pestalotopismicrospora (Stierle et al., Science 260, 214-216 (1993); Strobel et al.,Microbiol. 142, 435440 (1996)). DNA can also be obtained from trees thatnaturally produce taxol, such as Taxus brevifolia. DNA encoding oneenzyme in the taxol pathway, taxadiene synthase, which it is believedcatalyzes the committed step in taxol biosynthesis and may be ratelimiting in overall taxol production, has been cloned (Wildung &Croteau, J Biol. Chem. 271, 9201-4 (1996). The DNA is then stochastic&/or non-stochastic mutagenized, and shufflants are screened/selectedfor production of the secondary metabolite. For example, taxolproduction can be monitored using antibodies to taxol, by massspectroscopy or UV spectrophotometry. Alternatively, production ofintermediates in taxol synthesis or enzymes in the taxol syntheticpathway can be monitored. Concetti & Ripani, Biol. Chem. Hoppe Seyler375, 419-23 (1994). Other examples of secondary metabolites are polyols,amino acids, polyketides, non-ribosomal polypeptides, ergosterol,carotenoids, terpinoids, sterols, vitamin E, and the like.

4.7.6 Increase Ability to Separate in Ethanol

Another desired property is to increase the flocculence of yeast tofacilitate separation in preparation of ethanol. Yeast can be stochastic&/or non-stochastic mutagenized by any of the procedures noted abovewith selection for stochastic &/or non-stochastic mutagenized yeastforming the largest clumps.

4.7.6.1 Exemplary Procedure for Yeast Protoplasting

Protoplast preparation in yeast is reviewed by Morgan, in Protoplasts(Birkhauser Verlag, Basel, 1983). Fresh cells (˜10⁸) are washed withbuffer, for example 0.1 M potassium phosphate, then resuspended in thissame buffer containing a reducing agent, such as 50 mM DTT, incubatedfor 1 h at 30° C. with gentle agitation, and then washed again withbuffer to remove the reducing agent. These cells are then resuspended inbuffer containing a cell wall degrading enzyme, such as Novozyme 234 (1mg/mL), and any of a variety of osmotic stabilizers, such as sucrose,sorbitol, NaCl, KCl, MgSO₄, MgCl₂, or NH₄Cl at any of a variety ofconcentrations. These suspensions are then incubated at 30° C. withgentle shaking (−60 rpm) until protoplasts are released. To generateprotoplasts that are more likely to produce productive fusants severalstrategies are possible.

Protoplast formation can be increased if the cell cycle of theprotoplasts have been synchronized to be halted at G1. In the case of S.cerevisiae this can be accomplished by the addition of mating factors,either a or alpha (Curran & Carter, J. Gen. Microbiol. 129, 1589-1591(1983)). These peptides act as adenylate cyclase inhibitors which bydecreasing the cellular level of cAMP arrest the cell cycle at G1. Inaddition, sex factors have been shown to induce the weakening of thecell wall in preparation for the sexual fusion of a and alpha cells(Crandall & Brock, Bacteriol. Rev. 32, 139-163 (1968); Osumi et al.,Arch. Microbiol 97, 27-38 (1974)). Thus in the preparation ofprotoplasts, cells can be treated with mating factors or other knowninhibitors of adenylate cyclase, such as leflunomide or the killer toxinfrom K. lactis, to arrest them at G1 (Sugisaki et al., Nature 304,464-466 (1983)). Then after fusing of the protoplasts (step 2), cAMP canbe added to the regeneration medium to induce S-phase and DNA synthesis.Alternatively, yeast strains having a temperature sensitive mutation inthe CDC4 gene can be used, such that cells could be synchronized andarrested at G1. After fusion cells are returned to the permissivetemperature so that DNA synthesis and growth resumes.

Once suitable protoplasts have been prepared, it is necessary to inducefusion by physical or chemical means. An equal number of protoplasts ofeach cell type is mixed in phosphate buffer (0.2 M, pH 5.8, 2×10⁸cells/mL) containing an osmotic stabilizer, for example 0.8 M NaCl, andPEG 6000 (33% w/v) and then incubated at 30° C. for 5 min while fusionoccurs. Polyols, or other compounds that bind water, can be employed.The fusants are then washed and resuspended in the osmoticallystabilized buffer lacking PEG, and transferred to osmotically stabilizedregeneration medium on/in which the cells can be selected or screenedfor a desired property.

4.7.7 Stochastic &/or Non-Stochastic Mutagenesis Using ArtificialChromosomes

Yeast artificial chromosomes (Yacs) are yeast vectors into which verylarge DNA fragments (e.g., 50-2000 kb) can be cloned (see, e.g., Monaco& Larin, Trends. Biotech. 12(7), 280-286 (1994); Ramsay, Mol Biotechnol1(2), 181-201 1994; Huxley, Genet. Eng. 16, 65-91 (1994); Jakobovits,Curr. Biol. 4(8), 761-3 (1994); Lamb & Gearhart, Curr. Opin. Genet. Dev.5(3), 342-8 (1995); Montoliu et al., Reprod Fertil. Dev. 6, 577-84(1994)). These vectors have telomeres (Tel), a centromere (Cen), anautonomously replicating sequence (ARS), and can have genes for positive(e.g., TRPI) and negative (e.g., URA3) selection. YACs are maintained,replicated, and segregate as other yeast chromosomes through bothmeiosis and mitosis thereby providing a means to expose cloned DNA totrue meiotic recombination.

YACs provide a vehicle for the stochastic &/or non-stochasticmutagenesis of libraries of large DNA fragments in vivo. The substratesfor stochastic &/or non-stochastic mutagenesis are typically largefragments from 20 kb to 2 Mb. The fragments can be random fragments orcan be fragments known to encode a desirable property. For example, afragment might include an operon of genes involved in production ofantibiotics. Libraries can also include whole genomes or chromosomes.Viral genomes and some bacterial genomes can be cloned intact into asingle YAC. In some libraries, fragments are obtained from a singleorganism. Other libraries include fragment variants, as where somelibraries are obtained from different individuals or species. Fragmentvariants can also be generated by induced mutation. Typically, geneswithin fragments are expressed from naturally associated regulatorysequences within yeast. However, alternatively, individual genes can belinked to yeast regulatory elements to form an expression cassette, anda concatemer of such cassettes, each containing a different gene, can beinserted into a YAC.

In some instances, fragments are incorporated into the yeast genome, andstochastic &/or non-stochastic mutagenesis is used to evolve improvedyeast strains. In other instances, fragments remain as components ofYACs throughout the stochastic &/or non-stochastic mutagenesis process,and after acquisition of a desired property, the YACs are transferred toa desired recipient cell.

4.7.8 Stochastic &/or Non-Stochastic Mutagenesis of Genes forBioremediation

Modern industry generates many pollutants for which the environment canno longer be considered an infinite sink. Naturally occurringmicroorganisms are able to metabolize thousands of organic compounds,including many not found in nature (e.g xenobiotics). Bioremediation,the deliberate use of microorganisms for the biodegradation of man-madewastes, is an emerging technology that offers cost and practicalityadvantages over traditional methods of disposal. The success ofbioremediation depends on the availability of organisms that are able todetoxify or mineralize pollutants.

Microorganisms capable of degrading specific pollutants can be generatedby genetic engineering and recursive sequence recombination. Althoughbioremediation is an aspect of pollution control, a more useful approachin the long term is one of prevention before industrial waste is pumpedinto the environment. Exposure of industrial waste streams to recursivesequence recombination-generated microorganisms capable of degrading thepollutants they contain would result in detoxification of mineralizationof these pollutants before the waste stream enters the environment.Issues of releasing recombinant organisms can be avoided by containingthem within bioreactors fitted to the industrial effluent pipes. Thisapproach would also allow the microbial mixture used to be adjusted tobest degrade the particular wastes being produced. Finally, this methodwould avoid the problems of adapting to the outside world and dealingwith competition that face many laboratory microorganisms.

In the wild, microorganisms have evolved new catabolic activitiesenabling them to exploit pollutants as nutrient sources for which thereis no competition. However, pollutants that are present at low isconcentrations in the environment may not provide a sufficient advantageto stimulate the evolution of catabolic enzymes. For a review of suchnaturally occurring evolution of biodegradative pathways and themanipulation of some of microorganisms by classical techniques, seeRamos et al., Bio/Technology “12: 1349-1355 (1994).

Generation of new catabolic enzymes or pathways for bioremediation hasthus relied upon deliberate transfer of specific genes between organisms(Wackett et al., supra), forced matings between bacteria with specificcatabolic capabilities (Brenner et al. Biodegradation 5: 359-377(1994)), or prolonged selection in a chemostat Some researchers haveattempted to facilitate evolution via naturally occurring geneticmechanisms in their chemostat selections by including microorganismswith a variety of catabolic pathways (Kellogg et. al. Science 214:1133-1135 (1981); Chakrabarty American Society of Micro. Biol. News 62:130-137 (1996)). For a review of efforts in this area, see Cameron etal. Applied Biochem. Biotech 38: 105-140 (1993).

Current efforts in improving organisms for bioremediation take alabor-intensive approach in which many parameters are optimizedindependently, including transcription efficiency from native andheterologous promoters, regulatory circuits and translational efficiencyas well as improvement of protein stability and activity (Timmis et al.Ann. Rev. Microbiol. 48: 525-527 (1994)).

A recursive sequence recombination approach overcomes a number oflimitations in the bioremediation capabilities of naturally occurringmicroorganisms. Both enzyme activity and specificity can be altered,simultaneously or sequentially, by the methods of the invention. Forexample, catabolic enzymes can be evolved to increase the rate at whichthey act on a substrate. Although knowledge of a rate-limiting step in ametabolic pathway is not required to practice the invention,rate-limiting proteins in pathways can be evolved to have increasedexpression and/or activity, the requirement for inducing substances canbe eliminated, and enzymes can be evolved that catalyze novel reactions.

Some examples of chemical targets for bioremediation include but are notlimited to benzene, xylene, and toluene, camphor, naphthalene,halogenated hydrocarbons, polychlorinated biphenyls (PCBs),trichlorethylene, pesticides such as pentachlorophenyls (PCPs), andherbicides such as atrazine.

4.7.8.1 Aromatic Hydrocarbons

Preferably, when an enzyme is “evolved” to have a new catalyticfunction, that function is expressed, either constitutively or inresponse to the new substrate. Recursive sequence recombination subjectsboth structural and regulatory elements (including the structure ofregulatory proteins) of a protein to recombinogenic mutagenesissimultaneously. Selection of mutants that are efficiently able to usethe new substrate as a nutrient source will be sufficient to ensure thatboth the enzyme and its regulation are optimized, without detailedanalysis of either protein structure or operon regulation.

Examples of aromatic hydrocarbons include but are not limited tobenzene, xylene, toluene, biphenyl, and polycyclic aromatic hydrocarbonssuch as pyrene and naphthalene. These compounds are metabolized viacatechol intermediates.

Degradation of catechol by Pseudomonas putida requires induction of thecatabolic operon by cis, cis-muconate which acts on the CatR regulatoryprotein. The binding site for the CatR protein is G-N₁₁-A, while theoptimal sequence for the LysR class of activators (of which CatR is amember) is T-N₁₁-A. Mutation of the G to a T in the CatR binding siteenhances the expression of catechol metabolizing genes (Chakrabarty,American Society of Microbiology News 62: 130-137 (1996)). Thisdemonstrates that the control of existing catabolic pathways is notoptimized for the metabolism of specific xenobiotics. It is also anexample of a type of mutant that would be expected from recursivesequence recombination of the operon followed by selection of bacteriathat are better able to degrade the target compound.

As an example of starting materials, dioxygenases are required for manypathways in which aromatic compounds are catabolized. Even smalldifferences in dioxygenase sequence can lead to significant differencesin substrate specificity (Furukawa et al. J. Bact. 175: 5224-5232(1993); Erickson et al. App. Environ. Micro. 59: 3858-3862 (1993)). Ahybrid enzyme made using sequences derived from two “parental” enzymesmay possess catalytic activities that are intermediate between theparents (Erickson, ibid.), or may actually be better than either parentfor a specific reaction (Furukawa et al. J. Bact. 176: 2121-2123(1994)). In one of these cases site directed mutagenesis was used togenerate a single polypeptide with hybrid sequence (Erickson, ibid.); inthe other, a four subunit enzyme was produced by expressing two subunitsfrom each of two different dioxygenases (Furukawa, ibid.). Thus,sequences from one or more genes encoding dioxygenases can be used inthe recursive sequence recombination techniques of the instantinvention, to generate enzymes with new specificities. In addition,other features of the catabolic pathway can also be evolved using thesetechniques, simultaneously or sequentially, to optimize the metabolicpathway for an activity of interest.

4.7.8.2 Halogenated Hydrocarbons

Large quantities of halogenated hydrocarbons are produced annually foruses as solvents and biocides. These include, in the United Statesalone, over 5 million tons of both 1,2-dichloroethane and vinyl chlorideused in PVC production in the U.S. alone.

The compounds are largely not biodegradable by processes in singleorganisms, although in principle haloaromatic catabolic pathways can beconstructed by combining genes from different microorganisms. Enzymescan be manipulated to change their substrate specificities. Recursivesequence recombination offers the possibility of tailoring enzymespecificity to new substrates without needing detailed structuralanalysis of the enzymes.

As an example of possible starting materials for the methods of theinstant invention, Wackett et al. (Nature 368: 627-629 (1994)) recentlydemonstrated that through classical techniques a recombinant Pseudomonasstrain in which seven genes encoding two multi-component oxygenases arecombined, generated a single host that can metabolize polyhalogenatedcompounds by sequential reductive and oxidative techniques to yieldnon-toxic products. These and/or related materials can be subjected tothe techniques discussed above so as to evolve and optimize abiodegradative pathway in a single organism.

Trichloroethylene is a significant groundwater contaminant. It isdegraded by microorganisms in a cometabolic way (i.e., no energy ornutrients are derived). The enzyme must be induced by a differentcompound (e.g., Pseudomonas cepacia uses toluene4-monoxygenase, whichrequires induction by toluene, to destroy trichloroethylene).Furthermore, the degradation pathway involves formation of highlyreactive epoxides that can inactivate the enzyme (Timmis et al. Ann.Rev. Microbiol. 48: 525-557 (1994)). The recursive sequencerecombination techniques of the invention could be used to mutate theenzyme and its regulatory region such that it is producedconstitutively, and is less susceptible to epoxide inactivation. In someembodiments of the invention, selection of hosts constitutivelyproducing the enzyme and less susceptible to the epoxides can beaccomplished by demanding growth in the presence of increasingconcentrations of trichloroethylene in the absence of inducingsubstances.

4.7.8.3 Polychlorinated Biphenyls and Polycyclic Aromatic Hydrocarbons

Polychlorinated Biphenyls (PCBs) an Polycyclic Aromatic Hydrocarbons(PAHs) PCBs and PAHs are families of structurally related compounds thatare major pollutants at many Superfund sites. Bacteria transformed withplasmids encoding enzymes with broader substrate specificity have beenused commercially. In nature, no known pathways have been generated in asingle host that degrade the larger PAHs or more heavily chlorinatedPCBs. Indeed, often the collaboration of anaerobic and aerobic bacteriaare required for complete metabolism.

Thus, likely sources for starting material for recursive sequencerecombination include identified genes encoding PAH-degrading catabolicpathways on large (20-100KB) plasmids (Sanseverino et al. AppliedEnviron. Micro. 59: 1931-1937 (1993); Simon et al. Gene 127: 31-37(1993); Zylstra et al. Annals of the NY Acad. Sci. 721: 386-398 (1994));while biphenyl and PCB-metabolizing enzymes are encoded by chromosomalgene clusters, and in a number of cases have been cloned onto plasmids(Hayase et al. J. Bacteriol. 172: 1160-1164 (1990); Furukawa et al. Gene98: 21-28 (1992); Hofer et al. Gene 144: 9-16 (1994)). The materials canbe subjected to the techniques discussed above so as to evolve abiodegradative pathway in a single organism.

Substrate specificity in the PCB pathway largely results from enzymesinvolved in initial dioxygenation reactions, and can be significantlyaltered by mutations in those enzymes (Erickson et al. Applied Environ.Micro. 59: 3858-38662 (1993); Furukawa et al. J. Bact. 175: 5224-5232(1993). Mineralization of PAHs and PCBs requires that the downstreampathway is able to metabolize the products of the initial reaction(Brenner et al. Biodegradation 5: 359-377 (1994)). In this case,recursive sequence recombination of the entire pathway with selectionfor bacteria able to use the PCE or PAH as the sole carbon source willallow production of novel PCB and PAH degrading bacteria.

4.7.8.4 Herbicides

A general method for evolving genes for the catabolism of insolubleherbicides is exemplified as follows for atrazine. Atrazine[2-chloro-4-(ethylamino)-6-(isopropylamino)-1,3,5-triazine] is amoderately persistent herbicide which is frequently detected in groundand surface water at concentrations exceeding the 3 ppb health advisorylevel set by the EPA. Atrazine can be slowly metabolized by aPseudomonas species (Mandelbaum et al. Appl. Environ. Micro. 61:1451-1457 (1995)). The enzymes catalyzing the first two steps inatrazine metabolism by Pseudomonas are encoded by genes AtzA and AtzB(de Souza et al. Appl. Environ. Micro. 61: 3373-3378 (1995)). Thesegenes have been cloned in a 6.8 kb fragment into pUC 18 (AtzAB-pUC). E.coli carrying this plasmid converts atrazine to much more solublemetabolites. It is thus possible to screen for enzyme activity bygrowing bacteria on plates containing atrazine. The herbicide forms anopaque precipitate in the plates, but cells containing AtzAB-pU18secrete atrazine degrading enzymes, leading to a clear halo around thosecells or colonies. Typically, the size of the halo and the rate of itsformation can be used to assess the level of activity so that pickingcolonies with the largest halos allows selection of the more active orhighly produced atrazine degrading enzymes.

Thus, the plasmids carrying these genes can be subjected is to therecursive sequence recombination formats described above to optimize thecatabolism of atrazine in E. coli or another host of choice, includingPseudomonas. After each round of recombination, screening of hostcolonies expressing the evolved genes can be done on agar platescontaining atrazine to observe halo formation. This is a generallyapplicable method for screening enzymes that metabolize insolublecompounds to those that are soluble (e.g., polycyclicaromatichydrocarbons). Additionally, catabolism of atrazine can provide a sourceof nitrogen for the cell; if no other nitrogen is available, cell growthwill be limited by the rate at which the cells can catabolize nitrogen.Cells able to utilize atrazine as a nitrogen source can thus be selectedfrom a background of non-utilizers or poor-utilizers.

4.7.8.5 Heavy Metal Detoxification

Bacteria are used commercially to detoxify arsenate waste generated bythe mining of arsenopyrite gold ores. As well as mining effluent,industrial waste water is often contaminated with heavy metals (e.g.,those used in the manufacture of electronic components and plastics).Thus, simply to be able to perform other bioremedial functions,microorganisms must be resistant to the levels of heavy metals present,including mercury, arsenate, chromate, cadmium, silver, etc.

A strong selective pressure is the ability to metabolize a toxiccompound to one less toxic. Heavy metals are toxic largely by virtue oftheir ability to denature proteins (Ford et al. Bioextraction andBiodeterioration of Metals, p. 1-23). Detoxification of heavy metalcontamination can be effected in a number of ways including changing thesolubility or bioavailability of the metal, changing its redox state(e.g. toxic mercuric chloride is detoxified by reduction to the muchmore volatile elemental mercury) and even by bioaccumulation of themetal by immobilized bacteria or is plants. The accumulation of metalsto a sufficiently high concentration allows metal to be recycled;smelting burns off the organic part of the organism, leaving behindreusable accumulated metal. Resistances to a number of heavy metals(arsenate, cadmium, cobalt, chromium, copper, mercury, nickel, lead,silver, and zinc) are plasmid encoded in a number of species includingStaphylococcus and Pseudomonas (Silver et al. Environ. Health Perspect.102: 107-113 (1994); Ji et al. J. Ind. Micro. 14: 61-75 (1995). Thesegenes also confer heavy metal resistance on other species as well (e.g.,E. coli). The recursive sequence recombination techniques of the instantinvention (RSR) can be used to increase microbial heavy metaltolerances, as well as to increase the extent to which cells willaccumulate heavy metals. For example, the ability of E. coli to detoxifyarsenate can be improved at least 100-fold by RSR.

Cyanide is very efficiently used to extract gold from rock containing aslittle as 0.2 oz. per ton. This cyanide can be microbially neutralizedand used as a nitrogen source by fungi or bacteria such as Pseudomonasfluorescens. A problem with microbial cyanide degradation is thepresence of toxic heavy metals in the leachate. RSR can be used toincrease the resistance of bioremedial microorganisms to toxic heavymetals, so that they will be able to survive the levels present in manyindustrial and Superfund sites. This will allow them to biodegradeorganic pollutants including but not limited to aromatic hydrocarbons,halogenated hydrocarbons, and biocides.

4.7.8.6 Microbial Mining

“Bioleaching” is the process by which microbes convert insoluble metaldeposits (usually metal sulfides or oxides) into soluble metal sulfates.Bioleaching is commercially important in the mining of arsenopyrite, buthas additional potential in the detoxification and recovery of metalsand acids from waste dumps. Naturally occurring bacteria capable ofbioleaching are reviewed by Rawlings and Silver (Bio/Technology 13:773-778 (1995)).

These bacteria are typically divided into groups by their preferredtemperatures for growth. The more important mesophiles are Thiobacillusand Leptospirillum species. Moderate thermophiles include Sulfobacillusspecies. Extreme thermophiles include Sulfolobus species. Many of theseorganisms are difficult to grow in commercial industrial settings,making their catabolic abilities attractive candidates for transfer toand optimization in other organisms such as Pseudomonas, Rhodococcus, T.ferrooxidans or E. coli. Genetic systems are available for at least onestrain of T. ferrooxidans, allowing the manipulation of its geneticmaterial on plasmids.

The recursive sequence recombination methods described above can be usedto optimize the catalytic abilities in native hosts or heterologoushosts for evolved bioleaching genes or pathways, such as the ability toconvert metals from insoluble to soluble salts. In addition, leach ratesof particular ores can be improved as a result of, for example,increased resistance to toxic compounds in the ore concentrate,increased specificity for certain substrates, ability to use differentsubstrates as nutrient sources, and so on.

4.7.8.7 Oil Desulfurization

The presence of sulfur in fossil fuels has been correlated withcorrosion of pipelines, pumping, and refining equipment, and with thepremature breakdown of combustion engines. Sulfur also poisons manycatalysts used in the refining of fossil fuels. The atmospheric emissionof sulfur combustion products is known as acid rain.

Microbial desulfurization is an appealing bioremediation application.Several bacteria have been is reported that are capable of catabolizingdibenzothiophene (DBT), which is the representative compound of theclass of sulfur compounds found in fossil fuels. U.S. Pat. No. 5,356,801discloses the cloning of a DNA molecule from Rhodococcus rhodochrouscapable of biocatalyzing the desulfurization of oil. Denome et al. (Gene175: 6890-6901 (1995)) disclose the cloning of a 9.8 kb DNA fragmentfrom Pseudomonas encoding the upper naphthalene catabolizing pathwaywhich also degrades dibenzothiophene. Other genes have been identifiedthat perform similar functions (disclosed in U.S. Pat. No. 5,356,801).

The activity of these enzymes is currently too low to be commerciallyviable, but the pathway could be increased in efficiency using therecursive sequence recombination techniques of the invention. Thedesired property of the genes of interest is their ability todesulfurize dibenzothiophene or its alkyl or aryl substituted analogues.In some embodiments of the invention, selection is preferablyaccomplished by coupling this pathway to one providing a nutrient to thebacteria. Thus, for example, desulfurization of dibenzothiophene resultsin formation of hydroxybiphenyl.

This is a substrate for the biphenyl-catabolizing pathway which providescarbon and energy. Selection would thus be done by “stochastic &/ornon-stochastic mutagenesis” the dibenzothiophene genes and transformingthem into a host containing the biphenyl-catabolizing pathway. Increaseddibenzothiophene desulfurization will result in increased nutrientavailability and increased growth rate. Once the genes have been evolvedthey are easily separated from the biphenyl degrading genes. The latterare undesirable in the final product since the object is to desulfurizewithout decreasing the energy content of the oil. Alkyl or arylsubstituted dibenzothiophenes can be detected by changes in fluorescence(Krawiec, S., Devel. Indus. Microbiology 31: 103-114 (1990)) or bydetection of phenol groups formed as a result of desulfurization (Dacre,J. C. Anal. Chem. 43: 589-591 (1971)).

4.7.8.8 Organo-Nitro Compounds

Organo-nitro compounds are used as explosives, dyes, drugs, polymers andantimicrobial agents. Biodegradation of these compounds occurs usuallyby way of reduction of the nitrate group, catalyzed by nitroreductases,a family of broadly-specific enzymes. Partial reduction of organo-nitrocompounds often results in the formation of a compound more toxic thanthe original (Hassan et al. 1979 Arch Bioch Biop. 196: 385-395).Recursive sequence recombination of nitroreductases can produce enzymesthat are more specific, and able to more completely reduce (and thusdetoxify) their target compounds (examples of which include but are notlimited to nitrotoluenes and nitrobenzenes). Nitro-reductases can beisolated from bacteria isolated from explosive-contaminated soils, suchas Morganella morganii and Enterobacter cloacae (Bryant et. al., 1991.J. Biol. Chem. 266: 41264130). A preferred selection method is to lookfor increased resistance to the organo-nitro compound of interest, sincethat will indicate that the enzyme is also able to reduce any toxicpartial reduction products of the original compound.

4.7.8.9 Alternative Substrates for Chemical Synthesis

Metabolic engineering can be used to alter microorganisms that produceindustrially useful chemicals, so that they will grow using alternateand more abundant sources of nutrients, including human-producedindustrial wastes. This typically involves providing both a transportsystem to get the alternative substrate into the engineered cells andcatabolic enzymes from the natural host organisms to the engineeredcells.

In some instances, enzymes can be secreted into the medium by engineeredcells to degrade the alternate is substrate into a form that can morereadily be taken up by the engineered cells; in other instances, a batchof engineered cells can be grown on one preferred substrate, then lysedto liberate hydrolytic enzymes for the alternate substrate into themedium, while a second inoculum of the same engineered host or a secondhost is added to utilize the hydrolyzate.

The starting materials for recursive sequence recombination willtypically be genes for utilization of a substrate or its transport.Examples of nutrient sources of interest include but are not limited tolactose, whey, galactose, mannitol, xylan, cellobiose, cellulose andsucrose, thus allowing cheaper production of compounds including but notlimited to ethanol, tryptophan, rhamnolipid surfactants, xanthan gum,and polyhydroxylalkanoate. For a review of such substrates as desiredtarget substances, see Cameron et al. (Appl. Biochem. Biotechnol. 38:105-140 (1993)). The recursive sequence recombination methods describedabove can be used to optimize the ability of native hosts orheterologous hosts to utilize a substrate of interest, to evolve moreefficient transport systems, to increase or alter specificity forcertain substrates, and so on.

4.7.8.10 Modification of Cell Properties

Although not strictly examples of manipulation of intermediarymetabolism, recursive sequence recombination techniques can be used toimprove or alter other aspects of cell properties, from growth rate toability to secrete certain desired compounds to ability to tolerateincreased temperature or other environmental stresses. Some examples oftraits engineered by traditional methods include expression ofheterologous proteins in bacteria, yeast, and other eukaryotic cells,antibiotic resistance, and phage resistance. Any of these traits isadvantageously evolved by the recursive sequence recombinationtechniques of the instant invention. Examples include replacement of onenutrient uptake system (e.g. ammonia in Methylophilus methylotrophus)with another that is more energy efficient; expression of haemoglobin toimprove growth under conditions of limiting oxygen; redirection of toxicmetabolic end products to less toxic compounds; expression of genesconferring tolerance to salt, drought and toxic compounds and resistanceto pathogens, antibiotics and bacteriophage, reviewed in Cameron et. al.Appl Biochem Biotechnol, 38: 105-140 (1993).

The heterologous genes encoding these functions all have the potentialfor further optimization in their new hosts by existing recursivesequence recombination technology. Since these functions increase cellgrowth rates under the desired growth conditions, optimization of thegenes by evolution simply involves recombining the DNA recursively andselecting the recombinants that grow faster with limiting oxygen, highertoxic compound concentration, or whatever is the appropriate growthcondition for the parameter being improved.

Since these functions increase cell growth rates under the desiredgrowth conditions, optimization of the genes by “evolution” can simplyinvolve “stochastic &/or non-stochastic mutagenesis” the DNA andselecting the recombinants that grow faster with limiting oxygen, highertoxic compound concentration or whatever restrictive condition is beingovercome. Cultured mammalian cells also require essential amino acids tobe present in the growth medium. This requirement could also becircumvented by expression of heterclogous metabolic pathways thatsynthesize these amino acids (Rees et al., Biotechnology 8: 629-633(1990). Recursive sequence recombination would provide a mechanism foroptimizing the expression of these genes in mammalian cells. Once again,a preferred selection would be for cells that can grow in the absence ofadded amino acids.

Yet another candidate for improvement through the techniques of theinvention is symbiotic nitrogen fixation. Genes involved in nodulation(nod, ndv), nitrogen reduction (nif, fix), host range determination(nod, hsp), bacteriocin production (tfx), surface polysaccharidesynthesis (exo) and energy utilization (dct, hup) which have beenidentified (Paau, Biotech. Adv. 9: 173-184 (1991)). The main function ofrecursive sequence recombination in this case is in improving thesurvival of strains that are already known to be better nitrogen fixers.These strains tend to be less good at competing with strains alreadypresent in the environment, even though they are better at nitrogenfixation. Targets for recursive sequence recombination such asnodulation and host range determination genes can be modified andselected for by their ability to grow on the new host.

Similarly any bacteriocin or energy utilization genes that will improvethe competitiveness of the strain will also result in greater growthrates. Selection can simply be performed by subjecting the target genesto recursive sequence recombination and forcing the inoculant to competewith wild type nitrogen fixing bacteria. The better the nitrogen fixingbacteria grow in the new host, the more copies of their recombined geneswill be present for the next round of recombination. This growth ratedifferentiating selection is described above in detail.

4.7.8.11 Biodetectors/Biosensors

Bioluminescence or fluorescence genes can be used as reporters by fusingthem to specific regulatory genes (Cameron et. al. Appl. BiochemBiotechnol, 38: 105-140 (1993)). A specific example is one in which theluciferase genes luxCDABE of Vibrio fischeri were fused to theregulatory region of the isopropylbenzene catabolism operon fromPseudomonas putida RE204.

Transformation of this fusion construct into E. coli resulted in astrain which produced light in response to a variety of hydrophobiccompound such as substituted benzenes, chlorinated solvents andnaphthalene (Selifonova et. al., Appl Environ Microbiol 62: 778-783(1996)). This type of construct is useful for the detection of pollutantlevels, and has the added benefit of only measuring those pollutantsthat are bioavailable (and therefore potentially toxic). other signalmolecules such as jellyfish green fluorescent protein could also befused to genetic regulatory regions that respond to chemicals in theenvironment. This should allow a variety of molecules to be detected bytheir ability to induce expression of a protein or proteins which resultin light, fluorescence or some other easily detected signal. Recursivesequence recombination can be used in several ways to modify this typeof biodetection system. It can be used to increase the amplitude of theresponse, for example by increasing the fluorescence of the greenfluorescent protein. Recursive sequence recombination could also be usedto increase induced expression levels or catalytic activities of othersignal-generating systems, for example of the luciferase genes.Recursive sequence recombination can also be used to alter thespecificity of biosensors. The regulatory region, and transcriptionalactivators that interact with this region and with the chemicals thatinduce transcription can also be stochastic &/or non-stochasticmutagenized. This should generate regulatory systems in whichtranscription is activated by analogues of the normal inducer, so thatbiodetectors for different chemicals can be developed. In this case,selection would be for constructs that are activated by the (new)specific chemical to be detected. Screening could be done simply withfluorescence (or light) activated cell sorting, since the desiredimprovement is in light production. In addition to detection ofenvironmental pollutants, biosensors can be developed that will respondto any chemical for which there are receptors, or for which receptorscan be evolved by recursive sequence recombination, such as hormones,growth factors, metals and drugs. These receptors may be intracellularand direct activators of transcription, or they may be membrane boundreceptors that activate transcription of the signal indirectly, forexample by a phosphorylation cascade. They may also not act ontranscription at all, but may produce a signal by somepost-transcriptional modification of a component of the signalgenerating pathway. These receptors may also be generated by fusingdomains responsible for binding different ligands with differentsignaling domains. Again, recursive sequence recombination can be usedto increase the amplitude of the signal generated to optimize expressionand functioning of chimeric receptors, and to alter the specificity ofthe chemicals detected by the receptor.

4.8 Promoting Genetic Exchange

Some methods of the invention effect recombination of cellular DNA bypropagating cells under conditions inducing exchange of DNA betweencells. DNA exchange can be promoted by generally applicable methods suchas electroporation, biolistics, cell fusion, or in some instances, byconjugation, transduction, or agrobacterium mediated transfer andmeiosis. For example, Agrobacterium can transform S. cerevisiae withT-DNA, which is incorporated into the yeast genome by both homologousrecombination and a gap repair mechanism. (Piers et al., Proc. Natl.Acad. Sci. USA 93(4), 1613-8 (1996)).

In some methods, initial diversity between cells (i.e., before genomeexchange) is induced by chemical or radiation-induced mutagenesis of aprogenitor cell type, optionally followed by screening for a desiredphenotype. In other methods, diversity is natural as where cells areobtained from different individuals, strains or species.

In some stochastic &/or non-stochastic mutagenesis methods, inducedexchange of DNA is used as the sole means of effecting recombination ineach cycle of recombination. In other methods, induced exchange is usedin combination with natural sexual recombination of an organism. Inother methods, induced exchange and/or natural sexual recombination areused in combination with the introduction of a fragment library. Such afragment library can be a whole genome, a whole chromosome, a group offunctionally or genetically linked genes, a plasmid, a cosmid, amitochondrial genome, a viral genome (replicative and nonreplicative) orspecific or random fragments of any of these. The DNA can be linked to avector or can be in free form. Some vectors contain sequences promotinghomologous or nonhomologous recombination with the host genome. Somefragments contain double stranded breaks such as caused by shearing withglass beads, sonication, or chemical or enzymatic fragmentation, tostimulate recombination. In each case, DNA can be exchanged betweencells after which it can undergo recombination to form hybrid genomes.Generally, cells are recursively subject to recombination to increasethe diversity of the population prior to screening. Cells bearing hybridgenomes, e.g., generated after at least one, and usually several cyclesof recombination are screened for a desired phenotype, and cells havingthis phenotype are isolated. These cells can additionally form startingmaterials for additional cycles of recombination in a recursiverecombination/selection scheme.

4.8.1 Protoplast Fusion

One means of promoting exchange of DNA between cells is by fusion ofcells, such as by protoplast fusion. A protoplast results from theremoval from a cell of its cell wall, leaving a membrane-bound cell thatdepends on an isotonic or hypertonic medium for maintaining itsintegrity. If the cell wall is partially removed, the resulting cell isstrictly referred to as a spheroplast and if it is completely removed,as a protoplast. However, here the term protoplast includes spheroplastsunless otherwise indicated.

Protoplast fusion is described by Shaffner et al., Proc. Natl. Acad.Sci. USA 77, 2163 (1980) and other exemplary procedures are described byYoakum et al., U.S. Pat. No. 4,608,339, Takahashi et al., U.S. Pat. No.4,677,066 and Sambrooke et al., at Ch. 16. Protoplast fusion has beenreported between strains, species, and genera (e.g., yeast and chickenerythrocyte). Protoplasts can be prepared for both bacterial andeukaryotic cells, including mammalian cells and plant cells, by severalmeans including chemical treatment to strip cell walls. For example,cell walls can be stripped by digestion with a cell wall degradingenzyme such as lysozyme in a 10-20% sucrose, 50 mM EDTA buffer.Conversion of cells to spherical protoplasts can be monitored byphase-contrast microscopy. Protoplasts can also be prepared bypropagation of cells in. media supplemented with an inhibitor of cellwall synthesis, or use of mutant strains lacking capacity for cell wallformation. Preferably, eukaryotic cells are synchronized in G I phase byarrest with inhibitors such as -factor, K. lactis killer toxin,leflonamide and adenylate cyclase inhibitors. Optionally, some but notall, protoplasts to be fused can be killed and/or have their DNAfragmented by treatment with ultraviolet irradiation, hydroxylamine orcupferon (Reeves et al., FFMS Microbiol. Lett. 99, 193-198 (1992)). Inthis situation, killed protoplasts are referred to as donors, and viableprotoplasts as acceptors.

Using dead donors cells can be advantageous in subsequently recognizingfused cells with hybrid genomes, as described below. Further, breakingup DNA in donor cells is advantageous for stimulating recombination withacceptor DNA. Optionally, acceptor and/or fused cells can also bebriefly, but nonlethally, exposed to UV irradiation further to stimulaterecombination.

Once formed, protoplasts can be stabilized in a variety of osmolytes andcompounds such as sodium chloride, potassium chloride, sodium phosphate,potassium phosphate, sucrose, sorbitol in the presence of DTT. Thecombination of buffer, pH, reducing agent, and osmotic stabilizer can beoptimized for different cell types. Protoplasts can be induced to fuseby treatment with a chemical such as PEG, calcium chloride or calciumpropionate or electrofusion (Tsoneva, Acta Microbiologica Bulgaria 24,53-59 (1989)). A method of cell fusion employing electric fields hasalso been described. See Chang U.S. Pat. No. 4,970,154. Conditions canbe optimized for different strains.

The fused cells are heterokaryons containing genomes from two or morecomponent protoplasts. Fused cells can be enriched from unfused parentalcells by sucrose gradient sedimentation or cell sorting. The two nucleiin the heterokaryons can fuse (karyogamy) and homologous recombinationcan occur between the genomes. The chromosomes can also segregateasymmetrically resulting in regenerated protoplasts that have lost orgained whole chromosomes. The frequency of recombination can beincreased by treatment with ultraviolet irradiation or by use of strainsoverexpressing recA or other recombination genes, or the yeast radgenes, and cognate variants thereof in other species, or by theinhibition of gene products of MutS, MutL, or MutD. Overexpression canbe either the result of introduction of exogenous recombination genes orthe result of selecting strains, which as a result of natural variationor induced mutation, overexpress endogenous recombination genes. Thefused protoplasts are propagated under conditions allowing regenerationof cell walls, recombination and segregation of recombinant genomes intoprogeny cells from the heterokaryon and expression of recombinant genes.This process can be reiteratively repeated to increase the diversity ofany set of protoplasts or cells. After, or occasionally before orduring, recovery of fused cells, the cells are screened or selected forevolution toward a desired property.

Thereafter a subsequent round of recombination can be performed bypreparing protoplasts from the cells surviving selection/screening in aprevious round. The protoplasts are fused, recombination occurs in fusedprotoplasts, and cells are regenerated from the fused protoplasts. Thisprocess can again be reiteratively repeated to increase the diversity ofthe starting population. Protoplasts, regenerated or regenerating cellsare subject to further selection or screening.

Subsequent rounds of recombination can be performed on a split poolbasis as described above. That is, a first subpopulation of cellssurviving selection/screening from a previous round are used forprotoplast formation. A second subpopulation of cells survivingselection/screening from a previous round are used as a source for DNAlibrary preparation.

The DNA library from the second subpopulation of cells is thentransformed into the protoplasts from the first subpopulation. Thelibrary undergoes recombination with the genomes of the protoplasts toform recombinant genomes. This process can be repeated several times inthe absence of a selection event to increase the diversity of the cellpopulation. Cells are regenerated from protoplasts, andselection/screening is applied to regenerating or regenerated cells. Ina further variation, a fresh library of nucleic acid fragments isintroduced into protoplasts surviving selection/screening from aprevious round.

Protoplast formation of donor and recipient strains, heterokaryonformation, karyogamy, recombination, and segregation of recombinantgenomes into separate cells. Optionally, the recombinant genomes, ifhaving a sexual cycle, can undergo further recombination with each otheras a result of meiosis and mating. Recursive cycles of protoplastfusion, or recursive mating/meiosis is often used to increase thediversity of a cell population. After achieving a sufficiently diversepopulation via one of these forms of recombination, cells are screenedor selected for a desired property. Cells surviving selection/screeningcan then used as the starting materials in a further cycle ofprotoplasting or other recombination methods as noted herein.

4.8.2 Parasexual Reproduction

Parasexual reproduction provides a further means for stochastic &/ornon-stochastic mutagenesis genetic material between cells. This processallows recombination of parental DNA without involvement of mating typesor gametes. Parasexual fusion occurs by hyphal fusion giving rise to acommon cytoplasm containing different nuclei. The two nuclei can divideindependently in the resulting heterokaryon but occasionally fuse.Fusion is followed by haploidization, which can involve loss ofchromosomes and mitotic crossing over between homolgous chromosomes.Protoplast fusion is a form of parasexual reproduction.

4.8.3 Selection for Hybrid Strains

4.8.3.1 Identifying cells Formed by the Fusion of Components of ParentalCells from Two or more Distinct Subpopulations

The invention provides selection strategies to identify cells formed byfusion of components from parental cells from two or more distinctsubpopulations. Selection for hybrid cells is usually performed beforeselecting or screening for cells that have evolved (as a result ofgenetic exchange) to acquisition of a desired property. A basic premiseof most such selection schemes is that two initial subpopulations havetwo distinct markers. Cells with hybrid genomes can thus be identifiedby selection for both markers.

4.8.3.2 Method Where One Subpopulation has a Marker

In one such scheme, at least one subpopulation of cells bears aselective marker attached to its cell membrane. Examples of suitablemembrane markers include biotin, fluorescein and rhodamine. The markerscan be linked to amide or thiol groups or through more specificderivatization chemistries, such as iodo-acetates, iodoacetamides,maleimides.

For example, a marker can be attached as follows. Cells or protoplastsare washed with a buffer (e.g., PBS), which does not interfere with thechemical coupling of a chemically active ligand which reacts with aminogroups of lysines or N-terminal amino groups of membrane proteins. Theligand is either amine reactive itself (e.g., isothiocyanates,succinimidyl esters, sulfonyl chlorides) or is activated by aheterobifunctional linker (e.g. EMCS, SLAB, SPDP, SMB) to become aminereactive. The ligand is a molecule which is easily bound by proteinderivatized magnetic beads or other capturing solid supports. Forexample, the ligand can be succinimidyl activated biotin (MolecularProbes Inc.: B-1606, B-2603, S-1515, S-1582). This linker is reactedwith amino groups of proteins residing in and on the surface of a cell.The cells are then washed to remove excess labeling agent beforecontacting with cells from the second subpopulation bearing a secondselective marker.

The second subpopulation of cells can also bear a membrane marker,albeit a different membrane marker from the first subpopulation.Alternatively, the second subpopulation can bear a genetic marker. Thegenetic marker can confer a selective property such as drug resistanceor a screenable property, such as expression of green fluorescentprotein.

After fusion of first and second subpopulations of cells and recovery,cells are screened or selected for the presence of markers on bothparental subpopulations. For example, fusants are enriched for onepopulation by adsorbtion to specific beads and these are then sorted byFACS for those expressing a marker. Cells surviving both screens forboth markers are those having undergone protoplast fusion, and aretherefore more likely to have recombined genomes. Usually, the markersare screened or selected separately. Membrane-bound markers, such asbiotin, can be screened by affinity enrichment for the cell membranemarker (e.g., by panning fused cells on an affinity matrix). Forexample, for a biotin membrane label, cells can be affinity purifiedusing streptavidin-coated magnetic beads (Dynal). These beads are washedseveral times to remove the non-fused host cells.

Alternatively, cells can be panned against an antibody to the membranemarker. In a further variation, if the membrane marker is fluorescent,cells bearing the marker can be identified by FACS. Screens for geneticmarkers depend on the nature of the markers, and include capacity togrow on drug-treated media or FACS selection for green fluorescentprotein. If first and second cell populations have fluorescent markersof different wavelengths, both markers can be screened simultaneously byFACS sorting.

In a further selection scheme for hybrid cells, first and secondpopulations of cells to be fused express different subunits of aheteromultimeric enzyme. Usually, the heteromultimeric enzyme has twodifferent subunits, but heteromultimeric enzymes having three, four ormore different subunits can be used. If an enzyme has more than twodifferent subunits, each subunit can be expressed in a differentsubpopulation of cells (e.g., three subunits in three subpopulations),or more than one subunit can be expressed in the same subpopulation ofcells (e.g., one subunit in one subpopulation, two subunits in a secondsubpopulation). In the case where more than two subunits are used,selection for the poolwise recombination of more than two protoplastscan be achieved.

Hybrid cells representing a combination of genomes of first, second ormore subpopulation component cells can then be recognized by an assayfor intact enzyme. Such an assay can be a binding assay, but is moretypically a functional assay (e.g., capacity to metabolize a substrateof the enzyme). Enzymatic activity can be detected for example byprocessing of a substrate to a product with a fluorescent or otherwiseeasily detectable absorbance or emission spectrum. The individualsubunits of a heteromultimeric enzyme used in such an assay preferablyhave no enzymic activity in dissociated form, or at least havesignificantly less activity in dissociated form than associated form.Preferably, the cells used for fusion lack an endogenous form of theheteromultimeric enzyme, or at least have significantly less endogenousactivity than results from heteromultimeric enzyme formed by fusion ofcells.

Penicillin acylase enzymes, cephalosporin acylase and penicillinacyltransferase are examples of suitable heteromultimeric enzymes. Theseenzymes are encoded by a single gene, which is translated as a proenzymeand cleaved by posttranslational autocatalytic proteolysis to remove aspacer endopeptide and generate two subunits, which associate to formthe active heterodimeric enzyme. Neither subunit is active in theabsence of the other subunit. However, activity can be reconstituted ifthese separated gene portions are expressed in the same cell byco-transformation. Other enzymes that can be used have subunits that areencoded by distinct genes (e.g., faoA and faoB genes encode3-oxoacyl-CoA thiolase of Pseudoumonas fragi (Biochem. J. 328, 815-820(1997)).

An exemplary enzyme is penicillin G acylase from Escherichia coli, whichhas two subunits encoded by a single gene. Fragments of the geneencoding the two subunits operably linked to appropriate expressionregulation sequences are transfected into first and secondsubpopulations of cells, which lack endogenous penicillin acylaseactivity. A cell formed by fusion of component cells from the first andsecond subpopulations expresses the two subunits, which assemble to formfunctional enzyme, e.g., penicillin acylase. Fused cells can then beselected on agar plates containing penicillin G, which is degraded bypenicillin acylase.

In another variation, fused cells are identified by complementation. ofauxotrophic mutants. Parental subpopulations of cells can be selectedfor known auxotrophic mutations. Alternatively, auxotrophic mutations ina starting population of cells can be generated spontaneously byexposure to a mutagenic agent. Cells with auxotrophic mutations areselected by replica plating on minimal and complete media. Lesionsresulting in auxotrophy are expected to be scattered throughout thegenome, in genes for amino acid, nucleotide, and vitamin biosyntheticpathways. After fusion of parental cells, cells resulting from fusioncan be identified by their capacity to grow on minimal media. Thesecells can then be screened or selected for evolution toward a desiredproperty. Further steps of mutagenesis generating fresh auxotrophicmutations can be incorporated in subsequent cycles of recombination andscreening/selection.

In variations of the above method, de novo generation of auxotrophicmutations in each round of stochastic &/or non-stochastic mutagenesiscan be avoided by reusing the same auxotrophs. For example, auxotrophscan be generated by transposon mutagenesis using a transposon bearingselective marker. Auxotrophs are identified by a screen such as replicaplating. Auxotrophs are pooled, and a generalized transducing phagelysate is prepared by growth of phage on a population of auxotrophiccells. A separate population of auxtrophic cells is subjected to geneticexchange, and complementation is used to selected cells that haveundergone genetic exchange and recombination. These cells are thenscreened or selected for acquisition of a desired property. Cellssurviving screening or selection then have auxotrophic markersregenerated by introduction of the transducing transposon library. Thenewly generated auxotrophic cells can then be subject to further geneticexchange and screening/selection.

In a further variation, auxotrophic mutations are generated byhomologous recombination with a targeting vector comprising a selectivemarker flanked by regions of homology with a biosynthetic region of thegenome of cells to be evolved. Recombination between the vector and thegenome inserts the positive selection marker into the genome causing anauxotrophic mutation. The vector is in linear form before introductionof cells.

Optionally, the frequency of introduction of the vector can be increasedby capping its ends with self-complementarity oligonucleotides annealedin a hair pin formation. Genetic exchange and screening/selectionproceed as described above. In each round, targeting vectors arereintroduced regenerating the same population of auxotrophic markers.

In another variation, fused cells are identified by screening for agenomic marker present on one subpopulation of parental cells and anepisomal marker present on a second subpopulation of cells. For example,a first subpopulation of yeast containing mitochondria can be used tocomplement a second subpopulation of yeast having a petite phenotype(i.e., lacking mitochondria).

In a further variation, genetic exchange is performed between twosubpopulations of cells, one of which is dead. Cells are preferablykilled by brief exposure to DNA fragmenting agents such ashydroxylamine, cupferon, or irradiation. Viable cells are then screenedfor a marker present on the dead parental subpopulation.

4.8.4 Liposome Mediated Transfers

4.8.4.1 Nucleic Acid Fragment Libraries are Introduced into Protoplasts

4.8.4.1.1 The Nucleic Acids are Encapsulated in Liposomes to Help Uptakeby Protoplasts

In the methods noted above, in which nucleic acid fragment libraries areintroduced into protoplasts, the nucleic acids are sometimesencapsulated in liposomes to facilitate uptake by protoplasts.Liposome-mediated uptake of DNA by protoplasts is described in Redfordet al., Mol. Gen. Genet. 184, 567-569 (1981). Liposomes can efficientlydeliver large volumes of DNA to protoplasts (see Deshayes et al., FMBOJ. 4, 2731-2737 (1985)).

See also, Philippot and Schuber (eds) (1995) Liposomes as Tools in BasicResearch and Industry CRC press, Boca Raton, e.g., Chapter 9, Remy etal. “Gene Transfer with Cationic Amphiphiles.” Further, the DNA can bedelivered as linear fragments, which are often more recombinogenic thatwhole genomes. In some methods, fragments are mutated prior toencapsulation in liposomes. In some methods, fragments are combined withRecA and homologs, or nucleases (e.g., restriction endonucleases) beforeencapsulation in liposomes to promote recombination. Alternatively,protoplasts can be treated with lethal doses of nicking reagents andthen fused. Cells which survive are those which are repaired byrecombination with other genomic fragments, thereby providing aselection mechanism to select for recombinant (and therefore desirablydiverse) protoplasts.

4.9 Shuffing Using Filamentous Fungi

Filamentous fungi are particularly suited to performing the stochastic&/or non-stochastic mutagenesis methods described above. Filamentousfungi are divided into four main classifications based on theirstructures for sexual reproduction: Phycomycetes, Ascomycetes,Basidiomycetes and the Fungi Imperfecti. Phycomycetes (e-g., Rhizopus,Mucor) form sexual spores in sporangium.

The spores can be uni or multinucleate and often lack septated hyphae(coenocytic). Ascomycetes (e.g., Aspergillus, Neurospora, Penicillum)produce sexual spores in an ascus as a result of meiotic division. Ascitypically contain 4 meiotic products, but some contain 8 as a result ofadditional mitotic division. Basidiomycetes include mushrooms, and smutsand form sexual spores on the surface of a basidium. Inholobasidiomycetes, such as mushrooms, the basidium is undivided. Inhemibasidiomycetes, such as ruts (Uredinales) and smut fungi(Ustilaginales), the basidium is divided. Fungi imperfecti, whichinclude most human pathogens, have no known sexual stage.

Fungi can reproduce by asexual, sexual or parasexual means. Asexualreproduction, involves vegetative growth of mycelia, nuclear divisionand cell division without involvement of gametes and without nuclearfusion. Cell division can occur by sporulation, budding or fragmentationof hyphae.

4.9.1 Evolve fungi from Stochastic &/or Non-Stochastic Mutagenesis toBecome Useful Hosts for Genetic Engineering of Unrelated Genes

4.9.2 To Improve the Capacity of Fungi to Make Specific Compounds

One general goal of stochastic &/or non-stochastic mutagenesis is toevolve fungi to become useful hosts for genetic engineering, inparticular for the stochastic &/or non-stochastic mutagenesis ofunrelated genes. A. nidulans and neurospora are generally the fungalorganisms of choice to serve as a hosts for such manipulations becauseof their sexual cycles and wellestablished use in classical andmolecular genetics. Another general goal is to improve the capacity offungi to make specific compounds (e.g. antibacterials (penicillins,cephalosporins), antifungals (e.g. echinocandins, aureobasidins), andwood-degrading enzymes). There is some overlap between these generalgoals, and thus, some desired properties are useful for achieving bothgoals.

4.93 Mutator Strain

Another desired property is the production of a mutator strain of fungi.Such a fungus can be produced by stochastic &/or non-stochasticmutagenesis a fungal strain containing a marker gene with one or moremutations that impair or prevent expression of a functional product.Shufflants are propagated under conditions that select for expression ofthe positive marker (while allowing a small amount of residual growthwithout expression). Shufflants growing fastest are selected to form thestarting materials for the next round of stochastic &/or non-stochasticmutagenesis.

4.9.4 Expanded Host Range so able to Form Heterokaryons with MoreStrains

Another desired property is to expand the host range of a fungus so itcan form heterokaryons with fungi from other vegetative compatibilitygroups. Incompatability between species results from the interactions ofspecific alleles at different incompatability loci (such as the “het”loci). If two strains undergo hyphal anastomosis, a lethal cytoplasmicincomparability reaction may occur if the strains differ at these loci.Strains must carry identical loci to be entirely compatible. Several ofthese loci have been identified in various species, and theincompatibility effect is somewhat additive (hence, “partialincompatibility” can occur). Some tolerant and het-negative mutants havebeen described for these organisms (e.g. Dales & Croft, J. Gen.Microbiol. 0.136, 1717-1724 (1990)). Further, a tolerance gene (tol) hasbeen reported, which suppresses mating-type heterokaryonincompatibility. Stochastic &/or non-stochastic mutagenesis is performedbetween protoplasts of strains from different incompatibility groups. Apreferred format uses a five acceptor strain and a UV-irradiated deadacceptor strain. The UV irradiation serves to introduce mutations intoDNA inactivating het genes. The two strains should bear differentgenetic markers. Protoplasts of the strain are fused, cells areregenerated and screened for complementation of markers. Subsequentrounds of stochastic &/or non-stochastic mutagenesis and selection canbe performed in the same manner by fusing the cells surviving screeningwith protoplasts of a fresh population of donor cells. Similar to otherprocedures noted herein, the cells resulting from regeneration of theprotoplasts are optionally refused by protoplasting and regenerated intocells one or more times prior to any selection step to increase thediversity of the resulting population of cells to be screened.

4.9.5 Ability to Outbreed without Self-Breeding

Another desired property is the introduction of multiple-allelomorphheterothallism into Ascomyceles and Fungi imperfecti, which do notnormally exhibit this property. This mating system allows outbreedingwithout self-breeding. Such a mating system can be introduced bystochastic &/or non-stochastic mutagenesis Ascomycetes and Fungiimperfecti with DNA from Gasteromycetes or Hymenomycetes, which havesuch a system.

4.9.6 Spontaneous Formation of Protoplasts

Another desired property is spontaneous formation of protoplasts tofacilitate use of a fungal strain as a stochastic &/or non-stochasticmutagenesis host. Here, the fungus to be evolved is typicallymutagenized. Spores of the fungus to be evolved are briefly treated witha cell-wall degrading agent for a time insufficient for completeprotoplast formation, and are mixed with protoplasts from otherstrain(s) of fungi. Protoplasts formed by fusion of the two differentsubpopulations are identified by genetic or other selection/or screeningas described above. These protoplasts are used to regenerate mycelia andthen spores, which form the starting material for the next round ofstochastic &/or non-stochastic mutagenesis. In the next round, at leastsome of the surviving spores are treated with cell-wall removing enzymebut for a shorter time than the previous round. After treatment, thepartially stripped cells are labeled with a first label. These cells arethen mixed with protoplasts, which may derive from other cells survivingselection in a previous round, or from a fresh strain of fungi. Theseprotoplasts are physically labeled with a second label. After incubatingthe cells under conditions for protoplast fusion fusants with bothlabels are selected.

These fusants are used to generate mycelia and spores for the next roundof stochastic &/or non-stochastic mutagenesis, and so forth. Eventually,progeny that spontaneously form protoplasts (i.e., without addition ofcell wall degrading agent) are identified. As with other proceduresnoted herein, cells or protoplasts can be reiteratively fused andregenerated prior to performing any selection step to increase thediversity of the resulting cells or protoplasts to be screened.Similarly, selected cells or protoplasts can be reiteratively fused andregenerated for one or several cycles without imposing selection on theresulting cellular or protoplast populations, thereby increasing thediversity of cells or protoplasts which are eventually screened. Thisprocess of performing multiple cycles of recombination interspersed withselection steps can be reiteratively repeated as desired.

4.9.7 Acquisition or Improvement of Genes Encoding In BiosyntheticPathways; Transporter Proteins; And Metabolic Flux

Another desired property is the acquisition and/or improvement of genesencoding enzymes in biosynthetic pathways, genes encoding transporterproteins, and genes encoding proteins involved in metabolic fluxcontrol. In this situation, genes of the pathway can be introduced intothe fungus to be evolved either by genetic exchange with another strainof fungus possessing the pathway or by introduction of a fragmentlibrary from an organism possessing the pathway. Genetic material ofthese fungi can then be subjected to further stochastic &/ornon-stochastic mutagenesis and screening/selection by the variousprocedures discussed in this application.

Shufflant strains of fungi are selected/screened for production of thecompound produced by the metabolic pathway or precursors thereof.

4.9.8 Increased Stability to Extreme Conditions

Another desired property is increasing the stability of fungi to extremeconditions such as heat. In this situation, genes conferring stabilitycan be acquired by exchanging DNA with or transforming DNA from a strainthat already has such properties. Alternatively, the strain to beevolved can be subjected to random mutagenesis. Genetic material of thefungus to be evolved can be stochastic &/or non-stochastic mutagenizedby any of the procedures described in this application, with shufflantsbeing selected by surviving exposure to extreme conditions.

4.9.9 Growth Under Altered Nutritional Requirements

Another desired property is capacity of a fungus to grow under alterednutritional requirements (e.g., growth on particular carbon or nitrogensources). Altering nutritional requirements is particularly valuable,e.g., for natural isolates of fungi that produce valuable commercialproducts but have esoteric and therefore expensive nutritionalrequirement. The strain to be evolved undergoes genetic exchange and/ortransformation with DNA from a strain that has the desired nutritionalrequirements. The fungus to be evolved can then optionally be subjectedto further stochastic &/or non-stochastic mutagenesis as described inthis application and with recombinant strains being selected forcapacity to grow in the desired nutritional circumstances. Optionally,the nutritional circumstances can be varied in successive rounds ofstochastic &/or non-stochastic mutagenesis starting at close to thenatural requirements of the fungus to be evolved and in subsequentrounds approaching the desired nutritional requirements.

4.9.10 Natural Competance to Take Up a Plasmid Bearing a SelectiveMarker

Another desired property is acquisition of natural competence in afungus. The procedure for acquisition of natural competence bystochastic &/or non-stochastic mutagenesis is generally described inPCT/US97/04494. The fungus to be evolved typically undergoes geneticexchange or transformation with DNA from a bacterial strain or fungalstrain that already has this property.

Cells with recombinant genomes are then selected by capacity to take upa plasmid bearing a selective marker. Further rounds of recombinationand selection can be performed using any of the procedures describedabove.

4.9.11 Reduced or Increased Secretion of Proteases and DNAses

Another desired property is reduced or increased secretion of proteasesand DNase. In this situation, the fungus to be evolved can acquire DNAby exchange or transformation from another strain known to have thedesired property. Alternatively, the fungus to be evolved can be subjectto random mutagenesis. The fungus to be evolved is stochastic &/ornon-stochastic mutagenized as above. The presence of such enzymes, orlack thereof, can be assayed by contacting the culture media fromindividual isolates with a fluorescent molecule tethered to a supportvia a peptide or DNA linkage. Cleavage of the linkage releasesdetectable fluorescence to the media.

4.9.12 Altered Transporters to Use Secondary Components Another desiredproperty is producing fungi with altered transporters (e.g., MDR). Suchaltered transporters are useful, for example, in fungi that have beenevolved to produce new secondary metabolites, to allow entry ofprecursors required for synthesis of the new secondary metabolites intoa cell, or to allow efflux of the secondary metabolite from the cell.Transporters can be evolved by introduction of a library of transportervariants into fungal cells and allowing the cells to recombine by sexualor parasexual recombination. To evolve a transporter with capacity totransport a precursor into the cells, cells are propagated in thepresent of precursor, and cells are then screened for production ofmetabolite. To evolve a transporter with capacity to export ametabolite, cells are propagated under conditions supporting productionof the metabolite, and screened for export of metabolite to culturemedium.

A general method of fungal stochastic &/or non-stochastic mutagenesis isshown herein. Spores from a frozen stock, a lyophilized stock, or freshfrom an agar plate are used to inoculate suitable liquid medium (1).Spores are germinated resulting in hyphal growth (2). Mycelia areharvested, and washed by filtration and/or centrifugation. Optionallythe sample is pretreated with DTT to enhance protoplast formation (3).Protoplasting is performed in an osmotically stabling medium (e.g., 1 mNaCl/20 mM MgSO4, pH 5.8) by the addition of cell wall-degrading enzyme(e.g., Novozyme 234) (4). Cell wall degrading enzyme is removed byrepeated washing with osmotically stabilizing solution (5). Protoplastscan be separated from mycelia, debris and spores by filtration throughmiracloth, and density centrifugation (6). Protoplasts are harvested bycentrifugation and resuspended to the appropriate concentration. Thisstep may lead to some protoplast fusion (7). Fusion can be stimulated byaddition of PEG (e.g., PEG 3350), and/or repeated centrifugation andresuspension with or without PEG. Electrofusion can also be performed(8). Fused protoplasts can optionally be enriched from unfusedprotoplasts by sucrose gradient sedimentation (or other methods ofscreening described above). Fused protoplasts can optionally be treatedwith ultraviolet irradiation to stimulate recombination (9). Protoplastsare cultured on osmotically stabilized agar plates to regenerate cellwalls and form mycelia (10). The mycelia are used to generate spores(11), which are used as the starting material in the next round ofstochastic &/or non-stochastic mutagenesis (12). Selection for a desiredproperty can be performed either on regenerated mycelia or sporesderived therefrom.

In an alternative method, protoplasts are formed by inhibition of one ormore enzymes required for cell wall synthesis. The inhibitor should befungistatic rather than fungicidal under the conditions of use. Examplesof inhibitors include antifungal compounds described by (e.g.,Georgopapadakou & Walsh, Antimicrob. Ag. Chemother. 40, 279-291 (1996);Lyman & Walsh, Drugs 44, 9-35 (1992)). Other examples include chitinsynthase inhibitors (polyoxin or nikkomycin compounds) and/or glucansynthase inhibitors (e.g. echinocandins, papulocandins, pneumocandins).Inhibitors should be applied in osmotically stabilized medium. Cellsstripped of their cell walls can be fused or otherwise employed asdonors or hosts in genetic transformation/strain development programs.

In a further variation, protoplasts are prepared using strains of fungi,which are genetically deficient or compromised in their ability tosynthesize intact cell walls. Such mutants are generally referred to asfragile, osmotic-remedial, or cell wall-less, and are obtainable fromstrain depositories. Examples of such strains include Neurospora crassaos mutants (Selitrennikoff, Antimicrob. Agents. Chemother. 23, 757-765(1983)). Some such mutations are temperature-sensitive.Temperature-sensitive strains can be propagated at the permissivetemperature for purposes of selection and amplification and at anonpermissive temperature for purposes of protoplast formation andfusion. A temperature sensitive strain Neurospora crassa os strain hasbeen described which propagates as protoplasts when growth inosmotically stabilizing medium containing sorbose and polyoxin atnonpermissive temperature but generates whole cells on transfer tomedium containing sorbitol at a permissive temperature. See U.S. Pat.No. 4,873,196.

Other suitable strains can be produced by targeted mutagenesis of genesinvolved in chitin synthesis, glucan synthesis and other cellwall-related processes. Examples of such genes include CHT1, CHT2 andCALI (or CSD2) of Saccharomyces cerevisiae and Candida spp.(Georgopapadakou & Walsh 1996); ETGI/FKSI/CNDI/CWH53/PB RI and homologsin S. cerevisiae, Candida albicans, Cryptococcus neoformans, Aspergillusfumigatus, ChvAlNdvA Agrobacterium and Rhizobium. Other examples are M4,orlB, orlC, MD, tsE, and bimG of Aspergillus nidulans (Borgia, J.Bacteriol. 174, 377-389 (1992)).

Strains of A. nidulans containing OrlA1 or tse1 mutations lyse atrestrictive temperatures. Lysis of these strains may be prevented byosmotic stabilization, and the mutations may be complemented by theaddition of N-acetylglucosimine (GlcNac). BimG11 mutations are ts for atype I protein phosphatase (germlines of strains carrying this mutationlack chitin, and condia swell and lyse). Other suitable genes are chsA,chsB, chsC, chsD and chsE of Aspergillus fumigatus; chs1 and chs2 ofNeurospora crassa; Phycomyces blakesleeanus MM and chs 1, 2 and 3 of S.cerevisiae. Chs 1 is a non-essential repair enzyme; chs2 is involved inseptum formation and chs3 is involved in cell wall maturation and budring formation.

Other useful strains include S. cerevisiae CLY (cell lysis) mutants suchas ts strains (Paravicini et al., Mol. Cell Biol 12, 4896-4905 (1992)),and the CLY 15 strain which harbors a PKC 1 gene deletion. Other usefulstrains include strain VY 1160 containing a ts mutation in srb (encodingactin) (Schade et al. Acta Histochem. Suppl. 41, 193-200 (1991)), and astrain with an ses mutation which results in increased sensitivity tocell-wall digesting enzymes isolated from snail gut (Metha & Gregory,Appl. Environ. Microbiol 41, 992-999 (1981)). Useful strains of Calbicans include those with mutations in chs1, chs2, or chs3 (encodingchitin synthetases), such as osmotic remedial conditional lethal mutantsdescribed by Payton & de Tiani, Curr. Genet. 17, 293-296 (1990); C.utilis mutants with increased sensitivity to cell-wall digesting enzymesisolated from snail gut (Metha & Gregory, 1981, supra); and X crassamutants os-1, os-2, os-3, os-4, os-5, and os-6. See, Selitrennikoff,Antimicrob. Agents Chemother. 23, 757-765 (1983). Such mutants grow anddivide without a cell wall at 37 C, but at 22 C produce a cell wall.

Targeted mutagenesis can be achieved by transforming cells with apositive-negative selection vector containing homologous regionsflanking a segment to be targeted, a positive selection marker betweenthe homologous regions and a negative selection marker outside thehomologous regions (see Capecchi, U.S. Pat. No. 5,627,059). In avariation, the negative selection marker can be an antisense transcriptof the positive selection marker (see U.S. Pat. No. 5,527,674).

Other suitable cells can be selected by random mutagenesis or stochastic&/or non-stochastic mutagenesis procedures in combination withselection. For example, a first subpopulation of cells are mutagenized,allowed to recover from mutagenesis, subjected to incomplete degradationof cell walls and then contacted with protoplasts of a secondsubpopulation of cells. Hybrids cells bearing markers from bothsubpopulations are identified (as described above) and used as thestarting materials in a subsequent round of stochastic &/ornon-stochastic mutagenesis. This selection scheme selects both for cellswith capacity for spontaneous protoplast formation and for cells withenhanced recombinogenicity.

In a further variation, cells having capacity for spontaneous protoplastformation can be crossed with cells having enhanced recombinogenicityevolved using other methods of the invention. The hybrid cells areparticularly suitable hosts for whole genome stochastic &/ornon-stochastic mutagenesis.

Cells with mutations in enzymes involved in cell wall synthesis ormaintenance can undergo fusion simply as a result of propagating thecells in osmotic-protected culture due to spontaneous protoplastformation. If the mutation is conditional, cells are shifted to anonpermissive condition. Protoplast formation and fusion can beaccelerated by addition of promoting agents, such as PEG or an electricfield (See Philipova & Venkov, Yeast 6, 205-212 (1990); Tsoneva et al.,FFMS Microbiol Lett. 51, 61-65 (1989)).

4.10 Process of Sexual Reproduction

Sexual reproduction provides a mechanism for stochastic &/ornon-stochastic mutagenesis genetic material between cells. A sexualreproductive cycle is characterized by an alteration of a haploid phaseand a diploid phase. Diploidy occurs when two haploid gamete nuclei fuse(karyogamy). The gamete nuclei can come from the same parental strains(self-fertile), such as in the homothallic fungi. In heterothallicfungi, the parental strains come from strains of different mating type.

A diploid cell converts to haploidy via meiosis, which essentiallyconsists of two divisions of the nucleus accompanied by one division ofthe chromosomes. The products of one meiosis are a tetrad (4 haploidnuclei). In some cases, a mitotic division occurs after meiosis, givingrise to eight product cells. The arrangement of the resultant cells(usually enclosed in spores) resembles that of the parental strains. Thelength of the haploid and diploid stages differs in various fungi: forexample, the Basidiomycetes and many of the Ascomycetes have a mostlyhaploid life cycle (that is, meiosis occurs immediately afterkaryogamy), whereas others (e.g., Saccharomyces cerevisiae) are diploidfor most of their life cycle (karyogamy occurs soon after meiosis).Sexual reproduction can occur between cells in the same strain (selfing)or between cells from different strains (outcrossing). Sexual dimorphism(dioecism) is the separate production of male and female organs ondifferent mycelia. This is a rare phenomenon among the fungi, although afew examples are known. Heterothallism (one locus-two alleles) allowsfor outcrossing between crosscompatable strains which areself-incompatable. The simplest form is the two allele-one locus systemof mating types/factors, illustrated by the following organisms: A and ain Neurospora; a and in Saccharomyces, plus and minus inSchizzosaccharomyces and Zygomycetes; ₁ and ₂ in Ustilago.

Multiple-allelomorph heterothallism is exhibited by some of the higherBasidiomycetes (e.g. Gasteromycetes and Hymenomycetes), which areheterothallic and have several mating types determined by multiplealleles. Heterothallism. In these organisms is either bipolar with onemating type factor, or tetrapolar with two unlinked factors, A and B.Stable, fertile heterokaryon formation depends on the presence ofdifferent A factors and, in the case of tetrapolar organisms, ofdifferent B factors as well. This system is effective in the promotionof outbreeding and the prevention of self-breeding. The number ofdifferent mating factors may be very large (i.e. thousands) (Kothe, FEMSMicrobiol Rev. 18, 65-87 (1996)), and non-parental mating factors mayarise by recombination.

4.10.1 Introducing Sexual Cycles

4.10.2 Meiosis

4.10.2.1 Heterokaryon-A Cell or Hypha Containing Two or More Nuclei ofDifferent Genetic Constitutions

One desired property is the introduction of meiotic apparatus into fungipresently lacking a sexual cycle (see Sharon et al., Mol. Gen. Genet.251, 60-68 (1996)). A scheme for introducing a sexual cycle into thefungi P. chrysogenum (a fungus imperfecti) is shown herein.Subpopulations of protoplasts are formed from.4. nidulans (which has asexual cycle) and P. chrysogenum, which does not. The two strainspreferably bear different markers. The A. nidulans protoplasts arekilled by treatment with UV or hydroxylamine. The two subpopulations arefused to form heterokaryons. In some heterokaryons, nuclei fuse, andsome recombination occurs. Fused cells are cultured under conditions togenerate new cell walls and then to allow sexual recombination to occur.Cells with recombinant genomes are then selected (e.g., by selecting forcomplementation of auxotrophic markers present on the respective parentstrains). Cells with hybrid genomes are more likely to have acquired thegenes necessary for a sexual cycle. Protoplasts of cells can then becrossed with killed protoplasts of a further population of cells knownto have a sexual cycle (the same or different as the previous round) inthe same manner, followed by selection for cells with hybrid genomes.

4.10.2.2 Vegetative Compatibility Between Classes of Fungi

Within the above four classes, fungi are also classified by vegetativecompatibility group. Fungi within a vegetative compatibility group canform heterokaryons with each other. Thus, for exchange of geneticmaterial between different strains of fungi, the fungi are usuallyprepared from the same vegetative compatibility group. However, somegenetic exchange can occur between fungi from different incompatibilitygroups as a result of parasexual reproduction (see Timberlake et al.,U.S. Pat. No. 5,605,820). Further, as discussed elsewhere, the naturalvegetative compatibility group of fungi can be expanded as a result ofstochastic &/or non-stochastic mutagenesis. Several isolates ofAspergillus nidulans, A. flavus, A. fumigatus, Penicillium chrysogenum,P. notatum, Cephalosporium chrysogenum, Neurospora crassa, Aureobasidiumpullulans have been karyotyped. Genome sizes generally range between 20and 50 Mb among the Aspergilli. Differences in karyotypes often existbetween similar strains and are also caused by transformation withexogenous DNA. Filamentous fungal genes contain introns, usually 50-100bp in size, with similar consensus 5′ and 3′ splice sequences. Promotionand termination signals are often cross-recognizable, enabling theexpression of a gene/pathway from one fungus (e.g. A. nidulans) inanother (e.g. P. chrysogenum). The major components of the fungal cellwall are chitin (or chitosan), beta-glucan, and mannoproteins. Chitinand beta-glucan form the scaffolding, mannoproteins are interstitialcomponents which dictate the wall's porosity, antigenicity and adhesion.Chitin synthetase catalyzes the polymerization of beta-(1,4)-linkedN-acetylglucosamine (GIcNAc) residues, forming linear strands runningantiparallel; beta-(1,3)-glucan synthetase catalyze thehomopolymerization of glucose.

4.11 Evolution

4.11.1 Artificially Evolving Cells to Acquire a New or Improved Propertyby Stochastic &/or Non-Stochastic Mutagenesis

The invention provides a number of strategies for evolving metabolic andbioprocessing pathways through the technique of recursive sequencerecombination. One strategy entails evolving genes that confer theability to use a particular substrate of interest as a nutrient sourcein one species to confer either more efficient use of that substrate inthat species, or comparable or more efficient use of that substrate in asecond species. Another strategy entails evolving genes that confer theability to detoxify a compound of interest in one or more species oforganisms. Another strategy entails evolving new metabolic pathways byevolving an enzyme or metabolic pathway for biosynthesis or degradationof a compound A related to a compound B for the ability to biosynthesizeor degrade compound B, either in the host of origin or a new host. Afurther strategy entails evolving a gene or metabolic pathway for moreefficient or optimized expression of a particular metabolite or geneproduct. A further strategy entails evolving a host/vector system forexpression of a desired heterologous product. These strategies mayinvolve using all the genes in a multi-step pathway, one or severalgenes, genes from different organisms, or one or more fragments of agene.

The strategies generally entail evolution of gene(s) or segment(s)thereof to allow retention of function in a heterologous cell orimprovement of function in a homologous or heterologous cell. Evolutionis effected generally by a process termed recursive sequencerecombination. Recursive sequence recombination can be achieved in manydifferent formats and permutations of formats, as described in furtherdetail below. These formats share some common principles. Recursivesequence recombination entails successive cycles of recombination togenerate molecular diversity, i.e., the creation of a family of nucleicacid molecules showing substantial sequence identity to each other butdiffering in the presence of mutations. Each recombination cycle isfollowed by at least one cycle of screening or selection for moleculeshaving a desired characteristic. The molecule(s) selected in one roundform the starting materials for generating diversity in the next round.In any given cycle, recombination can occur in vivo or in vitro.Furthermore, diversity resulting from recombination can be augmented inany cycle by applying prior methods of mutagenesis (e.g., error-pronePCR or cassette mutagenesis, passage through bacterial mutator strains,treatment with chemical mutagens) to either the substrates for orproducts of recombination.

4.11.2 Basic Approach

4.11.2.1 Successive Cycles of Recombination and Screening/Selection

The invention provides methods for artificially evolving cells toacquire a new or improved property by recursive sequence recombination.Briefly, recursive sequence recombination entails successive cycles ofrecombination to generate molecular diversity and screening/selection totake advantage of that molecular diversity. That is, a family of nucleicacid molecules is created showing substantial sequence and/or structuralidentity but differing as to the presence of mutations. These sequencesare then recombined in any of the described formats so as to optimizethe diversity of mutant combinations represented in the resultingrecombined library. Typically, any resulting recombinant nucleic acidsor genomes are recursively recombined for one or more cycles ofrecombination to increase the diversity of resulting products. Afterthis recursive recombination procedure, the final resulting products arescreened and/or selected for a desired trait or property.

Alternatively, each recombination cycle can followed by at least onecycle of screening or selection for molecules having a desiredcharacteristic. In this embodiment, the molecule(s) selected in oneround form the starting materials for generating diversity in the nextround.

The cells to be evolved can be bacteria, archaebacteria, or eukaryoticcells and can constitute a homogeneous cell line or mixed culture.Suitable cells for evolution include the bacterial and eukaryotic, celllines commonly used in genetic engineering, protein expression, or theindustrial production or conversion of proteins, enzymes, primarymetabolites, secondary metabolites, fine, specialty or commoditychemicals. Suitable mammalian cells include those from, e.g., mouse,rat, hamster, primate, and human, both cell fines and primary cultures.Such cells include stem cells, including embryonic stem cells andhemopoietic stem cells, zygotes, fibroblasts, lymphocytes, Chinesehamster ovary (CHO), mouse fibroblasts (NIHM), kidney, liver, muscle,and skin cells. Other eukaryotic cells of interest include plant cells,such as maize, rice, wheat, cotton, soybean, sugarcane, tobacco, andarabidopsis; fish, algae, fungi (penicillium, aspergillus, podospora,neurospora, saccharomyces), insect (e.g., baculo lepidoptera), yeast(picchia and saccharomyces, Schizosaccharomyces pombe). Also of interestare many bacterial cell types, both gram-negative and gram-positive,such as Bacillus subtilis, B. licehniformis, B. cereus, Escherichiacoli, Streptomyces, Pseudomonas, Salmonella, Actinomycetes,Lactobacillius, Acelonitcbacter, Deinococcus, and Erwinia. The completegenome sequences of E. coli and Bacillus subtilis are described byBlattner et al., Science 277, 1454-1462 (1997); Kunst et al., Nature390, 249-256 (1997).

4.11.2.1.1 Goal is to Achieve Variaton

Evolution commences by generating a population of variant cells.Typically, the cells in the population are of the same type butrepresent variants of a progenitor cell. in some instances, thevariation is natural as when different cells are obtained from differentindividuals within a species, from different species or from differentgenera. in other instances, variation is induced by mutagenesis of aprogenitor cell. Mutagenesis can be effected by subjecting the cell tomutagenic agents, or if the cell is a mutator cell (e.g., has mutationsin genes involved in DNA replication, recombination and/or repair whichfavor introduction of mutations) simply by propagating the mutatorcells. Mutator cells can be generated from successive selections forsimple phenotypic changes (e.g., acquisition of rifampicin-resistance,then nalidixic acid resistance then lac− to lac+ (see Mao et al., JBacteriol 179, 417-422 (1997)), or mutator cells can be generated byexposure to specific inhibitors of cellular factors that result in themutator phenotype. These could be inhibitors of mutS, mutL, mutD, recD,mutY, mutM, dam, uvrD and the like.

More generally, mutations are induced in cell populations using anyavailable mutation technique. Common mechanisms for inducing mutationsinclude, but are not limited to, the use of strains comprising mutationssuch as those involved in mismatch repair. e.g. mutations in mutS, mutT,mutL and mutH; exposure to UV light; Chemical mutagenesis, e.g. use ofinhibitors of MMR, DNA damage inducible genes, or SOS inducers;overproduction/underproduction/mutation of any component of thehomologous recombination complex/pathway, e.g. RecA, ssb, etc.overproduction/underproduction/mutation of genes 9 involved in DNAsynthesis/homeostasis; overproduction/underproduction/mutation ofrecombination-stimulating genes from bacteria, phage (e.g. Lambda Redfunction), or other organisms; addition of chi sites into/flanking thedonor DNA fragments; coating the DNA fragments with RecA/ssb and thelike.

In other instances, variation is the result of transferring a library ofDNA fragments into the cells (e.g., by conjugation, protoplast fusion,liposome fusion, transformation, transduction or natural competence). Atleast one, and usually many of the fragments in the library, show some,but not complete, sequence or structural identity with a cognate orallelic gene within the cells sufficient to allow homologousrecombination to occur.

For example, in one embodiment, homologous integration of a plasmidcarrying a stochastic &/or non-stochastic mutagenized gene or metabolicpathway leads to insertion of the plasmid-borne sequences adjacent tothe genomic copy. Optionally, a counter-selectable marker strategy isused to select for recombinants in which recombination occurred betweenthe homologous sequences, leading to elimination of thecounter-selectable marker. A variety of selectable and counterselectable markers are amply illustrated in the art. For a fist ofuseful markers, see, Berg and Berg (1996), Transposable element toolsfor microbial genetics, Escherichia coli and Salmonella Neidhardt.Washington, D.C., ASM Press. 2: 2588-2612; La Rossa, ibid., 2527-2587.This strategy can be recursively repeated to maximize sequence diversityof targeted genes prior to screening/selection for a desired trait orproperty.

The library of fragments can derive from one or more sources. One sourceof fragments is a genomic library of fragments from a different species,cell type, organism or individual from the cells being transfected. Inthis situation, many of the fragments in the library have a cognate orallelic gene in the cells being transformed but differ from that genedue to the presence of naturally occurring species variation,polymorphisms, mutations, and the presence of multiple copies of somehomologous genes in the genome. Alternatively, the library can bederived from DNA from the same cell type as is being transformed afterthat DNA has been subject to induced mutation, by conventional methods,such as radiation, error-prone PCR, growth in a mutator organism,transposon mutagenesis, or cassette mutagenesis.

Alternatively, the library can derive from a genomic library offragments generated from the pooled genomic DNA of a population of cellshaving the desired characteristics. Alternatively, the library canderive from a genomic library of fragments generated from the pooledgenomic DNA of a population of cells having desired characteristics.

In any of these situations, the genomic library can be a completegenomic library or subgenome & library deriving, for example, from aselected chromosome, or part of a chromosome or an episomal elementwithin a cell. As well as, or instead of these sources of DNA fragments,the library can contain fragments representing natural or selectedvariants of selected genes of known function (i.e., focused libraries).

The number of fragments in a library can vary from a single fragment toabout 10¹⁰ with libraries having from 10³ to 10⁸ fragments being common.The fragments should be sufficiently long that they can undergohomologous recombination and sufficiently short that they can beintroduced into a cell, and if necessary, manipulated beforeintroduction. Fragment sizes can range from about 10 b to about 20 mb.Fragments can be double- or single-stranded. The fragments can beintroduced into cells as whole genomes or as components of viruses,plasmids, YACS, HACs or BACs or can be introduced as they are, in whichcase all or most of the fragments lack an origin of replication. Use ofviral fragments with single-stranded genomes offer the advantage ofdelivering fragments in single stranded form, which promotesrecombination. The fragments can also be joined to a selective markerbefore introduction. Inclusion of fragments in a vector having an originof replication affords a longer period of time after introduction intothe cell in which fragments can undergo recombination with a cognategene before being degraded or selected against and lost from the cell,thereby increasing the proportion of cells with recombinant genomes.Optionally, the vector is a suicide vector capable of a longer existencethan an isolated DNA fragment but not capable of permanent retention inthe cell line. Such a vector can transiently express a marker for asufficient time to screen for or select a cell bearing the vector (e.g.,because cells transduced by the vector are the target cell type to bescreened in subsequent selection assays), but is then degraded orotherwise rendered incapable of expressing the marker. The use of suchvectors can be advantageous in performing optional subsequent rounds ofrecombination to be discussed below. For example, some suicide vectorsexpress a long-lived toxin which is neutralized by a short-livedmolecule expressed from the same vector. Expression of the toxin alonewill not allow vector to be established. Jense & Gerdes, Mol. Microbiol,17, 205-210 (1995); Bernard et al., Gene 162, 159-160. Alternatively, avector can be rendered suicidal by incorporation of a defective originof replication (e.g. a temperature-sensitive origin of replication) orby omission of an origin of replication. Vectors can also be renderedsuicidal by inclusion of negative selection markers, such as ura3 inyeast or sacB in many bacteria.

These genes become toxic only in the presence of specific compounds.Such vectors can be selected to have a wide range of stabilities. A listof conditional replication defects for vectors which can be used, e.g.,to render the vector replication defective is found, e.g., in Berg andBerg (1996), “Transposable element tools for microbial genetics”Escherichia coli and Salmonella Neidhardt. Washington, D.C., ASM Press.2: 2588-2612. Similarly, a fist of counter selectable markers, generallyapplicable to vector selection is also found in Berg and Berg, id. Seealso, LaRossa (1996), “Mutant selections linking physiology, inhibitors,and genotypes” Escherichia coli and Salmonella F. C. Neidhardt.Washington, D.C., ASM Press. 2: 2527-2587.

After introduction into cells, the fragments can recombine with DNApresent in the genome, or episomes of the cells by homologous,nonhomologous or site-specific recombination. For present purposes,homologous recombination makes the most significant contribution toevolution of the cells because this form of recombination amplifies theexisting diversity between the DNA of the cells being transfected andthe DNA fragments. For example, if a DNA fragment being transfecteddiffers from a cognate or allelic gene at two positions, there are fourpossible recombination products, and each of these recombinationproducts can be formed in different cells in the transformed population.Thus, homologous recombination of the fragment doubles the initialdiversity in this gene. When many fragments recombine with correspondingcognate or allelic genes, the diversity of recombination products withrespect to starting products increases exponentially with the number ofmutations. Recombination results in modified cells having modifiedgenomes and/or episomes. Recursive recombination prior to selectionfurther increases diversity of resulting modified cells.

The variant cells, whether the result of natural variation, mutagenesis,or recombination are screened or selected to identify a subset of cellsthat have evolved toward acquisition of a new or improved property. Thenature of the screen, of course, depends on the property and severalexamples will be discussed below. Typically, recombination is repeatedbefore initial screening. Optionally, however, the screening can also berepeated before performing subsequent cycles of recombination.Stringency can be increased in repeated cycles of screening. Thesubpopulation of cells surviving screening are optionally subjected to afurther round of recombination. In some instances, the further round ofrecombination is effected by propagating the cells under conditionsallowing exchange of DNA between cells. For example, protoplasts can beformed from the cells, allowed to fuse, and regenerated. Cells withrecombinant genomes are propagated from the fused protoplasts.Alternatively, exchange of DNA can be promoted by propagation of cellsor protoplasts in an electric field. For cells having a conjugativetransfer apparatus, exchange of DNA can be promoted simply bypropagating the cells.

4.11.2.1.2 Using Two Separate Pools: Amplify First Pool and Add toSecond Pool

In other methods, the further round of recombination is performed by asplit and pool approach. That is, the surviving cells are divided intotwo pools. DNA is isolated from one pool, and if necessary amplified,and then transformed into the other pool.

Accordingly, DNA fragments from the first pool constitute a furtherlibrary of fragments and recombine with cognate fragments in the secondpool resulting in further diversity. As shown, a pool of mutant bacteriawith improvements in a desired phenotype is obtained and split. Genesare obtained from one half, e.g., by PCR, by cloning of random genomicfragments, by infection with a transducing phage and harvestingtransducing particles, or by the introduction of an origin of transfer(OriT) randomly into the relevant chromosome to create a donorpopulation of cells capable of transferring random fragments byconjugation to an acceptor population. These genes are then stochastic&/or non-stochastic mutagenized (in vitro by known methods or in vivo astaught herein), or simply cloned into an allele replacement vector(e.g., one carrying selectable and counter-selectable markers). The genepool is then transformed into the other half of the original mutant pooland recombinants are selected and screened for further improvements inphenotype. These best variants are used as the starting point for thenext cycle. Alternatively, recursive recombination by any of the methodsnoted can be performed prior to screening, thereby increasing thediversity of the population of cells to be screened.

4.11.2.13 Surviving Cells are Transfected into Fresh DNA

In other methods, some or all of the cells surviving screening aretransfected with a fresh library of DNA fragments, which can be the sameor different from the library used in the first round of recombination.In this situation, the genes in the fresh library undergo recombinationwith cognate genes in the surviving cells. If genes are introduced ascomponents of a vector, compatibility of this vector with any vectorused in a previous round of transfection should be considered. If thevector used in a previous round was a suicide vector, there is noproblem of incompatibility. If, however, the vector used in a previousround was not a suicide vector, a vector having a differentincompatibility origin should be used in the subsequent round. In all ofthese formats, further recombination generates additional diversity inthe DNA component of the cells resulting in further modified cells.

The further modified cells are subjected to another round ofscreening/selection according to the same principles as the first round.Screening/selection identifies a subpopulation of further modified cellsthat have further evolved toward acquisition of the property. Thissubpopulation of cells can be subjected to further rounds ofrecombination and screening according to the same principles, optionallywith the stringency of screening being increased at each round.Eventually, cells are identified that have acquired the desiredproperty.

4.11.3 Variations

4.11.3.1 Coating with RecA to Enrich Diversity of HomologousRecombination

The frequency of homologous recombination between library fragments andcognate endogenous genes can be increased by coating the fragments witha recombinogenic protein before introduction into cells. See Pati etal., Molecular Biology of Cancer 1, 1 (1996); Sena & Zarling, NatureGenetics 3, 365 (1996); Revet et al., J. Mol. Biol. 232, 779-791 (1993);Kowalczkowski & Zarling in Gene Targeting (CRC 1995), Ch. 7. Therecombinogenic protein promotes homologous pairing and/or strandexchange. The best characterized recA protein is from E. coli and isavailable from Pharmacia (Piscataway, N.J.).

In addition to the wild-type protein, a number of mutant recA-likeproteins have been identified (e.g., recA803). Further, many organismshave recA-like recombinases with strand-transfer activities (e.g., Ogawaet al., Cold Spring Harbor Symposium on Quantitative Biology 18, 567-576(1993); Johnson & Symington, Mol. Cell. Biol. 15, 4843-4850 (1995);Fugisawa et al., Nucl. Acids Res. 13, 7473 (1985); Hsieh et al., Cell44, 885 (1986); Hsieh et al., J. Biol. Chem. 264, 5089 (1989); Fishel etal., Proc. Nail. Acad. Sci. USA 85, 3683 (1988); Cassuto et al., Mol.Gen. Genet. 208, 10 (1987); Ganea et al., Mol. Cell Biol. 7, 3124(1987); Moore et al., J. Biol. Chem. 19, 11108 (1990); Keene et al.,Nucl. Acids Res. 12, 3057 (1984); Kimiec, Cold Spring Harbor Symp. 48,675 (1984); Kimeic, Cell 44, 545 (1986); Kolodner et al., Proc. Nad.Acad. Sci. USA 84, 5560 (1987); Sugino et al., Proc. Natl. Acad. Sci.USA 85, 3683 (1985). Halbrook et al., J. Biol. Chem. 264, 21403 (1989);Eisen et al., Proc. Natl. Acad. Sci. USA 85, 7481 (1988); McCarthy etal., Proc. Natl. Acad. Sci. USA 85, 5854 (1988). Lowenhaupt et al., J.Biol. Chem. 264, 20568 (1989). Examples of such recombinase proteinsinclude recA, recA803, uvsX, (Roca, A. I., Crit. Rev. Biochem. Molec.Biol. 25, 415 (1990)), sepI (Kolodner et al., Proc. Natl. Acad. Sci.(U.S.A.) 84, 5560 (1987); Tishkoff et al., Molec. Cell. Biol 11, 2593),RuvC (Dunderdale et al., Nature 354, 506 (1991)), DS72, KEMI, XRATI(Dykstra et al., Molec. Cell. Biol 11, 25 83 (1991)), STP/DSTI (Clark etal., Molec. Cell. Biol 11, 2576 (1991)), HPP-I (Moore et al., Proc.Natl. Acad. Sci. (U.S.A.) 88, 9067 (1991)), other eukaryoticrecombinases (Bishop et al., Cell 69, 439 (1992); Shinohara et al., Cell69, 457. RecA protein forms a nucleoprotein filament when it coats asingle-stranded DNA. In this nucleoprotein filament, one monomer of recAprotein is bound to about 3 nucleotides. This property of recA to coatsingle-stranded DNA is essentially sequence independent, althoughparticular sequences favor initial loading of recA onto a polynucleotide(e.g., nucleation sequences). The nucleoprotein filament(s) can beformed on essentially any DNA to be stochastic &/or non-stochasticmutagenized and can form complexes with both single-stranded anddouble-stranded DNA in prokaryotic and eukaryotic cells.

Before contacting with recA or other recombinase, fragments are oftendenatured, e.g., by heat-treatment. RecA protein is then added at aconcentration of about 1-10 gM. After incubation, the recA-coatedsingle-stranded DNA is introduced into recipient cells by conventionalmethods, such as chemical transformation or electroporation. In general,it can be desirable to coat the DNA with a RecA homolog isolated fromthe organism into which the coated DNA is being delivered. Recombinationinvolves several cellular factors and the host RecA equivalent generallyinteracts better with other host factors than less closely related RecAmolecules. The fragments undergo homologous recombination with cognateendogenous genes. Because of the increased frequency of recombinationdue to recombinase coating, the fragments need not be introduced ascomponents of vectors. Fragments are sometimes coated with other nucleicacid binding proteins that promote recombination, protect nucleic acidsfrom degradation, or target nucleic acids to the nucleus. Examples ofsuch proteins includes Agrobacterium virE2 (Duffenberger et al., Proc.Natl. Acad. Sci. USA 86, 9154-9158 (1989)). Alternatively, the recipientstrains are deficient in RecD activity. Single stranded ends can also begenerated by 3′-5′ exonuclease activity or restriction enzymes producing5′ overhangs.

4.11.3.2 Affinity Chromatography with MutS to Enrich for FragmentsHaving at Least One Mismatch

The E. coli mismatch repair protein MutS can be used in affinitychromatography to enrich for fragments of double-stranded DNA containingat least one base of mismatch. The MutS protein recognizes the bubbleformed by the individual strands about the point of the mismatch. See,e.g., Hsu & Chang, WO 9320233. The strategy of affinity enriching forpartially mismatched duplexes can be incorporated into the presentmethods to increase the diversity between an incoming library offragments and corresponding cognate or allelic genes in recipient cells.

MutS is used to increase diversity. The DNA substrates for enrichmentare substantially similar to each other but differ at a few sites.

For example, the DNA substrates can represent complete or partialgenomes (e.g., a chromosome library) from different individuals with thedifferences being due to polymorphisms. The substrates can alsorepresent induced mutants of a wild type sequence.

The DNA substrates are pooled, restriction digested, and denatured toproduce fragments of single-stranded DNA. The single-stranded DNA isthen allowed to reanneal. Some single-stranded fragments reanneal with aperfectly matched complementary strand to generate perfectly matchedduplexes. Other single-stranded fragments anneal to generate mismatchedduplexes. The mismatched duplexes are enriched from perfectly matchedduplexes by MutS chromatography (e.g., with MutS immobilized to beads).The mismatched duplexes recovered by chromatography are introduced intorecipient cells for recombination with cognate endogenous genes asdescribed above. MutS affinity chromatography increases the proportionof fragments differing from each other and the cognate endogenous gene.Thus, recombination between the incoming fragments and endogenous genesresults in greater diversity.

A second strategy for MutS enrichment. In this strategy, the substratesfor MutS enrichment represent variants of a relatively short segment,for example, a gene or cluster of genes, in which most of the differentvariants differ at no more than a single nucleotide. The goal of MutSenrichment is to produce substrates for recombination that contain morevariations than sequences occurring in nature. This is achieved byfragmenting the substrates at random to produce overlapping fragments.The fragments are denatured and reannealed as in the first strategy.Reannealing generates some mismatched duplexes which can be separatedfrom perfectly matched duplexes by MutS affinity chromatography. Asbefore, MutS chromatography enriches for duplexes bearing at least asingle mismatch. The mismatched duplexes are then stochastic &/ornon-stochastic mutagenized into longer fragments. This is accomplishedby cycles of denaturation, reannealing, and chain extension of partiallyannealed duplexes. After several such cycles, fragments of the samelength as the original substrates are achieved, except that thesefragments differ from each other at multiple sites. These fragments arethen introduced into cells where they undergo recombination with cognateendogenous genes.

4.1133 Suicide Vector Enriches Mutations for Cells that have Integratedthe Vector into the Host Chromosome

The invention further provides methods of enriching for cells bearingmodified genes relative to the starting cells. This can be achieved byintroducing a DNA fragment library (e.g., a single specific segment or awhole or partial genomic library) in a suicide vector (i.e., lacking afunctional replication origin in the recipient cell type) containingboth positive and negative selection markers. Optionally, multiplefragment libraries from different sources (e.g., B. sublilis, B.licheniformis and B. cereus) can be cloned into different vectorsbearing different selection markers. Suitable positive selection markersinclude neo^(R), kanamycin^(R), hyg, hisD, gpt, ble, tet^(R). Suitablenegative selection markers include hsv-tk, hprt, gpt, SacB ura3 andcytosine deaminase. A variety of examples of conditional replicationvectors, mutations affecting vector replication, limited host rangevectors, and counterselectable markers are found in Berg and Berg,supra, and LaRossa, ibid. and the references therein.

In one example, a plasmid with R6K and fl origins of replication, apositively selectable marker (beta-lactamase), and a counterselectablemarker (B. subtilis sacB) was used. M 13 transduction of plasmidscontaining cloned genes were efficiently recombined into the chromosomalcopy of that gene in a rep mutant E. coli strain.

Another strategy for applying negative selection is to include a wildtype rpsL gene (encoding ribosomal protein S12) in a vector for use incells having a mutant rpsL gene conferring streptomycin resistance. Themutant form of rpsL is recessive in cells having wild type rpsL. Thus,selection for Sm resistance selects against cells having a wild typecopy of rpsL. See Skorupski & Taylor, Gene 169, 47-52 (1996).Alternatively, vectors bearing only a positive selection marker can beused with one round of selection for cells expressing the marker, and asubsequent round of screening for cells that have lost the marker (e.g.,screening for drug sensitivity). The screen for cells that have lost thepositive selection marker is equivalent to screening against expressionof a negative selection marker. For example, Bacillus can be transformedwith a vector bearing a CAT gene and a sequence to be integrated. SeeHarwood & Cutting, Molecular Biological Methods for Bacillus, at pp.31-33. Selection for chloramphenicol resistance isolates cells that havetaken up vector. After a suitable period to allow recombination,selection for CAT sensitivity isolates cells which have lost the CATgene. About 50% of such cells will have undergone recombination with thesequence to be integrated.

Suicide vectors bearing a positive selection marker and optionally, anegative selection marker and a DNA fragment can integrate into hostchromosomal DNA by a single crossover at a site in chromosomal DNAhomologous to the fragment. Recombination generates an integrated vectorflanked by direct repeats of the homologous sequence. In some cells,subsequent recombination between the repeats results in excision of thevector and either acquisition of a desired mutation from the vector bythe genome or restoration of the genome to wild type.

In the present methods, after transfer of the gene library cloned in asuitable vector, positive selection is applied for expression of thepositive selection marker. Because nonintegrated copies of the suicidevector are rapidly eliminated from cells, this selection enriches forcells that have integrated the vector into the host chromosome. Thecells surviving positive selection can then be propagated and subjectedto negative selection, or screened for loss of the positive selectionmarker. Negative selection selects against cells expressing the negativeselection marker. Thus, cells that have retained the integrated vectorexpress the negative marker and are selectively eliminated. The cellssurviving both rounds of selection are those that initially integratedand then eliminated the vector. These cells are enriched for cellshaving genes modified by homologous recombination with the vector. Thisprocess diversifies by a single exchange of genetic information.However, if the process is repeated either with the same vectors or witha library of fragments generated by PCR of pooled DNA from the enrichedrecombinant population, resulting in the diversity of targeted genesbeing enhanced exponentially each round of recombination. This processcan be repeated recursively, with selection being performed as desired.

4.11.3.4 Exploiting Known Information such as Map Location or Function

In general, the above methods do not require knowledge of the number ofgenes to be optimized, their map location or their function. However, insome instances, where this information is available for one or moregene, it can be exploited. For example, if the property to be acquiredby evolution is enhanced recombination of cells, one gene likely to beimportant is recA, even though many other genes, known and unknown, maymake additional contributions. In this situation, the recA gene can beevolved, at least in part, separately from other candidate genes. TherecA gene can be evolved by any of the methods of recursiverecombination described in Section V. Briefly, this approach entailsobtaining, diverse forms of a recA gene, allowing the forms torecombine, selecting recombinants having improved properties, andsubjecting the recombinants to further cycles of recombination andselection. At any point in the individualized improvement of recA, thediverse forms of recA can be pooled with fragments encoding other genesin a library to be used in the general methods described herein. In thisway, the library is seeded to contain a higher proportion of variants ina gene known to be important to the property sought to be acquired thanwould otherwise be the case.

In one example, a plasmid is constructed carrying a non-functional(mutated) version of a chromosomal gene such as URA3, where thewild-type gene confers sensitivity to a drug (in this case 5-fluoroorotic acid). The plasmid also carries a selectable marker (resistanceto another drug such as kanamycin), and a library of recA variants.Transformation of the plasmid into the cell results in expression of therecA variants, some of which will catalyze homologous recombination atan increased rate. Those cells in which homologous recombinationoccurred are resistant to the selectable drug on the plasmid, and to5-fluoro orotic acid because of the disruption of the chromosomal copyof this gene. The recA variants which give the highest rates ofhomologous recombination are the most highly represented in a pool ofhomologous recombinants. The mutant recA genes can be isolated from thispool by PCR, re-stochastic &/or non-stochastic mutagenized, cloned backinto the plasmid and the process repeated. Other sequences can beinserted in place of recA to evolve other components of the homologousrecombination system.

4.11.3.5 Using Own Harvest of Cells so No Impurities

In some stochastic &/or non-stochastic mutagenesis methods, DNAsubstrates are isolated from natural sources and are not easilymanipulated by DNA modifying or polymerizing enzymes due to recacitrantimpurities, which poison enzymatic reactions. Such difficulties can beavoided by processing DNA substrates through a harvesting strain. Theharvesting strain is typically a cell type with natural competence and acapacity for homologous recombination between sequences with substantialdiversity (e.g., sequences exhibiting only 75% sequence identity). Theharvesting strain bears a vector encoding a negative selection markerflanked by two segments respectively complementary to two segmentsflanking a gene or other region of interest in the DNA from a targetorganism. The harvesting strain is contacted with fragments of DNA fromthe target organism. Fragments are taken up by natural competence, orother methods described herein, and a fragment of interest from thetarget organism recombines with the vector of the harvesting straincausing loss of the negative selection marker. Selection against thenegative marker allows isolation of cells that have taken up thefragment of interest.

Stochastic &/or non-stochastic mutagenesis can be carried out in theharvester strain (e.g., a RecE/T strain) or vector can be isolated fromthe harvester strain for in vitro stochastic &/or non-stochasticmutagenesis or transfer to a different cell type for in vivo stochastic&/or non-stochastic mutagenesis. Alternatively, the vector can betransferred to a different cell type by conjugation, protoplast fusionor electrofusion. An example of a suitable harvester strain isAcinelobacter calcoaceticus mutS. Melnikov and Youngman, (1999) NuclAcid Res 27(4): 1056-1062. This strain is naturally competent and takesup DNA in a nonsequence-specific manner. Also, because of the mutSmutation, this strain is capable of homologous recombination ofsequences showing only 75% sequence identity.

4.12 Further Applications

4.12.1 Improved Recombinancy

One goal of whole cell evolution is to generate cells having improvedcapacity for recombination. Such cells are useful for a variety ofpurposes in molecular genetics including the in vivo formats ofrecursive sequence recombination described in Section V. Almost thirtygenes (e.g., recA, recB, recC, recD, recE, recF, recG, recO, recQ, recR,recT, ruvA, ruvB, ruvC, sbcB, ssb, topA, gyrA and B, lig, polA, uvrD, E,recL, mutD, mutH, mutL, mutT, mutU, helD) and DNA sites (e.g., chi,recN, sbcC) involved in genetic recombination have been identified in E.coli, and cognate forms of several of these genes have been found inother organisms (e.g., rad51, rad55-rad57, Dmcl in yeast (seeKowalczykowski et al., Microbiol. Rev. 58, 401465 (1994); Kowalczkowski& Zarling, supra) and human homologs of Rad51 and Dmcl have beenidentified (see Sandier et al., Nucl. Acids Res. 24, 2125-2132 (1996)).At least some of the E. coli genes, including recA are functional inmammalian cells, and can be targeted to the nucleus as a fusion withSV40 large T antigen nuclear targeting sequence (Reiss et al., Proc.Mad. Acad. Sci. USA, 93, 3094-3098 (1996)). Further, mutations inmismatch repair genes, such as mutL, mutS, mutH mutT relax homologyrequirements and allow recombination between more diverged sequences(Rayssiguier et al., Nature 342, 396-401 (1989)). The extent ofrecombination between divergent strains can be enhanced by impairingmismatch repair genes and stimulating SOS genes. Such can be achieved byuse of appropriate mutant strains and/or growth under conditions ofmetabolic stress, which have been found to stimulate SOS and inhibitmismatch repair genes. Vulic et al., Proc. Mad. Acad. Sci. USA 94(1997). In addition, this can be achieved by impairing the products ofmismatch repair genes by exposure to selective inhibitors.

Starting substrates for recombination are selected according to thegeneral principles described above. That is, the substrates can be wholegenomes or fractions thereof containing recombination genes or sites.Large libraries of essentially random fragments can be seeded withcollections of fragments constituting variants of one or more knownrecombination genes, such as recA. Alternatively, libraries can beformed by mixing variant forms of the various known recombination genesand sites.

4.12.2 Expression of GFP Indicates Cell is Capable of HomologousRecombination

The library of fragments is introduced into the recipient cells to beimproved and recombination occurs, generating modified cells. Therecipient cells preferably contain a marker gene whose expression hasbeen disabled in a manner that can be corrected by recombination. Forexample, the cells can contain two copies of a marker gene bearingmutations at different sites, which copies can recombine to generate thewild type gene. A suitable marker gene is green fluorescent protein. Avector can be constructed encoding one copy of GFP having stop codonsnear the N-terminus, and another copy of GFP having stop codons near theC-terminus of the protein. The distance between the stop codons at therespective ends of the molecule is 500 bp and about 25% of recombinationevents result in active GFP. Expression of GFP in a cell signals that acell is capable of homologous recombination to recombine in between thestop codons to generate a contiguous coding sequence. By screening forcells expressing GFP, one enriches for cells having the highest capacityfor recombination. The same type of screen can be used followingsubsequent rounds of recombination. However, unless the selection markerused in previous round(s) was present on a suicide vector, subsequentround(s) should employ a second disabled screening marker within asecond vector bearing a different origin of replication or a differentpositive selection marker to vectors used in the previous rounds.

4.123 Increased Genome Copy Number so more Chromosomes per BacterialCell to Make Evolution Quicker

The majority of bacterial cells in stationary phase cultures grown inrich media contain two, four or eight genomes. In minimal medium thecells contain one or two genomes. The number of genomes per bacterialcell thus depends on the growth rate of the cell as it enters stationaryphase. This is because rapidly growing cells contain multiplereplication forks, resulting in several genomes in the cells aftertermination. The number of genomes is strain dependent, although allstrains tested have more than one chromosome in stationary phase. Thenumber of genomes in stationary phase cells decreases with time. Thisappears to be due to fragmentation and degradation of entirechromosomes, similar to apoptosis in mammalian cells. This fragmentationof genomes in cells containing multiple genome copies results in massiverecombination and mutagenesis. Useful mutants may find ways to useenergy sources that will allow them to continue growing. Multigenome orgene-redundant cells are much more resistant to mutagenesis and can beimproved for a selected trait faster.

Some cell types, such as Deinococcus radians (Daly and Minton JBacteriol 177, 5495-5505 (1995)) exhibit polyploidy throughout the cellcycle. This cell type is highly radiation resistant due to the presenceof many copies of the genome. High frequency recombination between thegenomes allows rapid removal of mutations induced by a variety of DNAdamaging agents.

A goal of the present methods is to evolve other cell types to haveincreased genome copy number akin to that of Deinoccocus radians.Preferably, the increased copy number is maintained through all or mostof its cell cycle in all or most growth conditions. The presence ofmultiple genome copies in such cells results in a higher frequency ofhomologous recombination in these cells, both between copies of a genein different genomes within the cell, and between a genome within thecell and a transfected fragment. The increased frequency ofrecombination allows the cells to be evolved more quickly to acquireother useful characteristics.

Starting substrates for recombination can be a diverse library of genesonly a few of which are relevant to genomic copy number, a focusedlibrary formed from variants of gene(s) known or suspected to have arole in genomic copy number or a combination of the two. As a generalrule one would expect increased copy number would be achieved byevolution of genes involved in replication and cell septation such thatcell septation is inhibited without impairing replication. Genesinvolved in replication include tus, xerc, xerD, dif, gyrA, gyrB, parE,parc, dif, TerA, TerB, TerC, TerD, TerE, TerF, and genes influencingchromosome partitioning and gene copy number include minD, mukA (toIC),mukB, mukC, mukD, spoOJ, spoIIIE (Wake & Errington, Annu. Rev. Genet.29, 41-67 (1995)). A useful source of substrates is the genome of a celltype such as Deinoccocus radians known to have the desired phenotype ofmultigenomic copy number. As well as, or instead of, the abovesubstrates, fragments encoding protein or antisense RNA inhibitors togenes known to be involved in cell septation can also be used. Innature, the existence of multiple genomic copies in a cell type wouldusually not be advantageous due to the greater nutritional requirementsneeded to maintain this copy number. However, artificial conditions canbe devised to select for high copy number.

Modified cells having recombinant genomes are grown in rich media (inwhich conditions, multicopy number should not be a disadvantage) andexposed to a mutagen, such as ultraviolet or gamma irradiation or achemical mutagen, e.g., mitomycin, nitrous acid, photoactivatedpsoralens, alone or in combination, which induces DNA breaks amenable torepair by recombination. These conditions select for cells havingmulticopy number due to the greater efficiency with which mutations canbe excised. Modified cells surviving exposure to mutagen are enrichedfor cells with multiple genome copies. If desired, selected cells can beindividually analyzed for genome copy number (e.g., by quantitativehybridization with appropriate controls). Some or all of the collectionof cells surviving selection provide the substrates for the next roundof recombination. In addition, individual cells can be sorted using acell sorter for those cells containing more DNA, e.g., using DNAspecific fluorescent compounds or sorting for increased size using lightdispersion. Eventually cells are evolved that have at least 2, 4, 6, 8or 10 copies of the genome throughout the cell cycle. In a similarmanner, protoplasts can also be recombined.

4.12.4 Evolve Secretion Pathways for Better Efficiency

4.12.5 Evolve to Manufacture Drugs or Chemicals

The protein (or metabolite) secretion pathways of bacterial andeukaryotic cells can be evolved to export desired molecules moreefficiently, such as for the manufacturing of protein pharmaceuticals,small molecule drugs or specialty chemicals. Improvements in efficiencyare particularly desirable for proteins requiring multisubunit assembly(such as antibodies) or extensive posttranslational modification beforesecretion.

The efficiency of secretion may depend on a number of genetic sequencesincluding a signal peptide coding sequence, sequences encodingprotein(s) that cleave or otherwise recognize the coding sequence, andthe coding sequence of the protein being secreted. The latter may affectfolding of the protein and the ease with which it can integrate into andtraverse membranes. The bacterial secretion pathway in E. coli includethe SecA, SecB, SecE, SecD and SecF genes. In Bacillus subtilis, themajor genes are secA, secD, secE, secF, secY, ffh, ftsy together withfive signal peptidase genes (sipS, sipT, sipU, sipV and sipW) (Kunst etal, supra). For proteins requiring posttranslational modification,evolution of genes effecting such modification may contribute toimproved secretion. Likewise genes with expression products having arole in assembly of multisubunit proteins (e.g., chaperonins) may alsocontribute to improved secretion.

Selection of substrates for recombination follows the general principlesdiscussed above. In this case, the focused libraries referred to abovecomprise variants of the known secretion genes. For evolution ofprokaryotic cells to express eukaryotic proteins, the initial substratesfor recombination are often obtained at least in part from eukaryoticsources.

Incoming fragments can undergo recombination both with chromosomal DNAin recipient cells and with the screening marker construct present insuch cells. (see below). The latter form of recombination is importantfor evolution of the signal coding sequence incorporated in thescreening marker construct. Improved secretion can be screened by theinclusion of marker construct in the cells being evolved. The markerconstruct encodes a marker gene, operably linked to expressionsequences, and usually operably linked to a signal peptide codingsequence. The marker gene is sometimes expressed as a fusion proteinwith a recombinant protein of interest. This approach is useful when onewants to evolve the recombinant protein coding sequence together withsecretion genes.

4.12.6 Evolve so Product is Toxic to Cell Unless Secreted

In one variation, the marker gene encodes a product that is toxic to thecell containing the construct unless ne product is secreted. Suitabletoxin proteins include diphtheria toxin and ricin toxin. Propagation ofmodified cells bearing such a construct selects for cells that haveevolved to improve secretion of the toxin. Alternatively, the markergene can encode a ligand to a known receptor, and cells bearing theligand can be detected by FACS using labeled receptor. Optionally, sucha ligand can be operably linked to a phospholipid anchoring sequencethat binds the ligand to the cell membrane surface following secretion.In a further variation, secreted marker protein can be maintained inproximity with the cell secreting it by distributing individual cellsinto agar drops. This is done, e.g., by droplet formation of a cellsuspension. Secreted protein is confined within the agar matrix and canbe detected by e.g., FACS. In another variation, a protein of interestis expressed as a fusion protein together with beta-lactamase oralkaline phosphatase. These enzymes metabolize commercially availablechromogenic substrates (e.g., X-gal), but do so only after secretioninto the periplasm. Appearance of colored substrate in a colony of cellstherefore indicates capacity to secrete the fusion protein and theintensity of color is related to the efficiency of secretion.

The cells identified by these screening and selection methods have thecapacity to secrete increased amounts of protein. This capacity may beattributable to increased secretion and increased expression, or fromincreased secretion alone.

4.12.7 Evolve to Acquire Increased Expression of Recombinant Protein

Expression Cells can also be evolved to acquire increased expression ofa recombinant protein. The level of expression is, of course, highlydependent on the construct from which the recombinant protein isexpressed and the regulatory sequences, such as the promoter,enhancer(s) and transcription termination site contained therein.Expression can also be affected by a large number of host genes havingroles in transcription, posttranslational modification and translation.In addition, host genes involved in synthesis of ribonucleotide andamino acid monomers for transcription and translation may have indirecteffects on efficiency of expression. Selection of substrates forrecombination follows the general principles discussed above. In thiscase, focused libraries comprise variants of genes known to have rolesin expression. For evolution of prokaryotic cells to express eukaryoticproteins, the initial substrates for recombination are often obtained,at least in part, from eukaryotic sources; that is eukaryotic genesencoding proteins such as chaperonins involved in secretion and/assemblyof proteins. Incoming fragments can undergo recombination both withchromosomal DNA in recipient cells and with the screening markerconstruct present in such cells (see below).

Screening for improved expression can be effected by including areporter construct in the cells being evolved. The reporter constructexpresses (and usually secretes) a reporter protein, such as GFP, whichis easily detected and nontoxic. The reporter protein can be expressedalone or together with a protein of interest as a fusion protein. If thereporter gene is secreted, the screening effectively selects for cellshaving either improved secretion or improved expression, or both.

4.12.8 Evovle Plant Cells to Acquire Resistance

A further application of recursive sequence recombination is theevolution of plant cells, and transgenic plants derived from the same,to acquire resistance to pathogenic diseases (fungi, viruses andbacteria), insects, chemicals (such as salt, selenium, pollutants,pesticides, herbicides, or the like), including, e.g., atrazine orglyphosate, or to modify chemical composition, yield or the like. Thesubstrates for recombination can again be whole genomic libraries,fractions thereof or focused libraries containing variants of gene(s)known or suspected to confer resistance to one of the above agents.Frequently, library fragments are obtained from a different species tothe plant being evolved.

The DNA fragments are introduced into plant tissues, cultured plantcells, plant microspores, or plant protoplasts by standard methodsincluding electroporation (From et al., Proc. Natl. Acad. Sci. USA 82,5824 (1985), infection by viral vectors such as cauliflower mosaic virus(CaMV) (Hohn et al., Molecular Biology of Plant Tumors, (Academic Press,New York, 1982) pp. 549-560; Howell, U.S. Pat. No. 4,407,956), highvelocity ballistic penetration by small particles with the nucleic acideither within the matrix of small beads or particles, or on the surface(Klein et al., Nature 327, 70-73 (1987)), use of pollen as vector (WO85/01856), or use of Agrobacterium tumefaciens or A. rhizogenes carryinga T-DNA plasmid in which DNA fragments are cloned. The T-DNA plasmid istransmitted to plant cells upon infection by Agrobacterium tumefaciens,and a portion is stably integrated into the plant genome (Horsch et al.,Science 233, 496498 (1984); Fraley et al., Proc. Natl. Acad. Sci. USA80, 4803 (1983)).

Diversity can also be generated by genetic exchange between plantprotoplasts according to the same principles described below for fungalprotoplasts. Procedures for formation and fusion of plant protoplastsare described by Takahashi et al., U.S. Pat. No. 4,677,066; Akagi etal., U.S. Pat. No. 5,360,725; Shimamoto et al., U.S. Pat. No. 5,250,433;Cheney et al., U.S. Pat. No. 5,426,040.

4.12.9 Plant Genome Stochastic &/or Non-Stochastic Mutagenesis

Plant genome stochastic &/or non-stochastic mutagenesis allows recursivecycles to be used for the introduction and recombination of genes orpathways that confer improved properties to desired plant species. Anyplant species, including weeds and wild cultivars, showing a desiredtrait, such as herbicide resistance, salt tolerance, pest resistance, ortemperature tolerance, can be used as the source of DNA that isintroduced into the crop or horticultural host plant species.

Genomic DNA prepared from the source plant is fragmented (e.g. byDNase1, restriction enzymes, or mechanically) and cloned into a vectorsuitable for making plant genomic libraries, such as pGA482 (An. G.,1995, Methods Mol. Biol. 44: 47-58). This vector contains the A.tumefaciens left and right borders needed for gene transfer to plantcells and antibiotic markers for selection in E. coli, Agrobacterium,and plant cells. A multicloning site is provided for insertion of thegenomic, fragments. A cos sequence is present for the efficientpackaging of DNA into bacteriophage lambda heads for transfection of theprimary library into E. coli. The vector accepts DNA fragments of 2540kb.

The primary library can also be directly electroporated into an A.tumefaciens or A. rhizogenes strain that is used to infect and transformhost plant cells (Main, G D et al., 1995, Methods Mol. Biol. 44:405-412). Alternatively, DNA can be introduced by electroporation orPEG-mediated uptake into protoplasts of the recipient plant species(Bilang et al. (1994) Plant Mol. Biol. Manual, Kluwer AcademicPublishers, A1: 1-16) or by particle bombardment of cells or tissues(Christou, ibid, A2: 1-15). If necessary, antibiotic markers in theT-DNA region can be eliminated, as long as selection for the trait ispossible, so that the final plant products contain no antibiotic genes.

Stably transformed whole cells acquiring the trait are selected on solidor liquid media containing the agent to which the introduced DNA confersresistance or tolerance. If the trait in question cannot be selected fordirectly, transformed cells can be selected with antibiotics and allowedto form callus or regenerated to whole plants and then screened for thedesired property.

The second and further cycles consist of isolating genomic DNA from eachtransgenic line and introducing it into one or more of the othertransgenic fines. In each round, transformed cells are selected orscreened for incremental improvement. To speed the process of usingmultiple cycles of transformation, plant regeneration can be deferreduntil the last round. Callus tissue generated from the protoplasts ortransformed tissues can serve as a source of genomic DNA and new hostcells. After the final round, fertile plants are regenerated and theprogeny are selected for homozygosity of the inserted DNAs. Ultimately,a new plant is created that carries multiple inserts which additively orsynergistically combine to confer high levels of the desired trait.Alternatively, microspores can be isolated as homozygotes generated fromspontaneous diploids.

In addition, the introduced DNA that confers the desired trait can betraced because it is flanked by known sequences in the vector. EitherPCR or plasmid rescue is used to isolate the sequences and characterizethem in more detail. Long PCR (Foord, O S and Rose, E A, 1995, PCRPrimer: A Laboratory Manual, CSBL Press, pp 63-77) of the full 2540 kbinsert is achieved with the proper reagents and techniques using asprimers the T-DNA border sequences. If the vector is modified to containthe E. coli origin of replication and an antibiotic marker between theT-DNA borders, a rare cutting restriction enzyme, such as NotI or SfiI,that cuts only at the ends of the inserted DNA is used to createfragments containing the source plant DNA that are then self-ligated andtransformed into E. coli where they replicate as plasmids. The total DNAor subfragment of it that is responsible for the transferred trait canbe subjected to in vitro evolution by DNA stochastic &/or non-stochasticmutagenesis. The stochastic &/or non-stochastic mutagenized library canbe reiteratively recombined by any method herein and then introducedinto host plant cells and screened for improvement of the trait. In thisway, single and multigene traits can be transferred from one species toanother and optimized for higher expression or activity leading to wholeorganism improvement. This entire process can also be reiterativelyrepeated. Alternatively, the cells can be transformed microspores withthe regenerated haploid plants being screened directly for improvedtraits as noted below.

4.12.10 Plant Cell is Put into Contact with Agent to See which CellsSurvive.

After a suitable period of incubation to allow recombination to occurand for expression of recombinant genes, the plant cells are contactedwith the agent to which resistance is to be acquired, and survivingplant cells are collected. Some or all of these plant cells can besubject to a further round of recombination and screening. Eventually,plant cells having the required degree of resistance are obtained.

These cells can then be cultured into transgenic plants. Plantregeneration from cultured protoplasts is described in Evans et al.,“Protoplast Isolation and Culture,” Handbook of Plant Cell Cultures 1,124-176 (MacMillan Publishing Co., New York, 1983); Davey, “RecentDevelopments in the Culture and Regeneration of Plant Protoplasts,”Protoplasts, (1983) pp. 12-29, (Birkhauser, Basal 1983); Dale,“Protoplast Culture and Plant Regeneration of Cereals and OtherRecalcitrant Crops,” Protoplasts (1983) pp. 31-41, (Birkhauser, Basel1983); Binding, “Regeneration of Plants,” Plant Protoplasts, pp. 21-73,(CRC Press, Boca Raton, 1985).

4.12.11 Start in Bacterial Cell Since Faster Evolution and Transforminto Plant

In a variation of the above method, one or more preliminary rounds ofrecombination and screening can be performed in bacterial cellsaccording to the same general strategy as described for plant cells.More rapid evolution can be achieved in bacterial cells due to theirgreater growth rate and the greater efficiency with which DNA can beintroduced into such cells. After one or more rounds ofrecombination/screening, a DNA fragment library is recovered frombacteria and transformed into the plant cells. The library can either bea complete library or a focused library. A focused library can beproduced by amplification from primers specific for plant sequences,particularly plant sequences known or suspected to have a role inconferring resistance.

4.12.12 Microspore Manipulation

Microspores are haploid (In) male spores that develop into pollengrains. Anthers contain a large numbers of microspores inearly-uninucleate to first-mitosis stages. Microspores have beensuccessfully induced to develop into plants for most species, such as,e.g., rice (Chen, CC 1977 In Vitro. 13: 484-489), tobacco (Atanassov, I.et al. 1998 Plant Mol. Biol. 38: 1169-1178), Tradescantia (Savage J R Kand Papworth D G. 1998 Mutat Res. 422: 313-322), Arabidopsis (Park S Ket al. 1998 Development. 125: 3789-3799), sugar beet (Majewska-Sawka Aand Rodrigues-Garcia N E 1996 J Cell Sci. 109: 859-866), Barley (Olsen FL 1991 Hereditas 115: 255-266) and oilseed rape (Boutillier K A et al.1994 Plant Mol. Biol. 26: 1711-1723).

The plants derived from microspores are predominantly haploid or diploid(infrequently polyploid and aneuploid). The diploid plants arehomozygous and fertile and can be generated in a relatively short time.Microspores obtained from F1 hybrid plants represent great diversity,thus being an excellent model for studying recombination. In addition,microspores can be transformed with T-DNA introduced by agrobacterium orother available means and then regenerated into individual plants.Furthermore, protoplasts can be made from microspores and they can befused similar to what occur in fungi and bacteria.

Microspores, due to their complex ploidy and regenerating ability,provide a tool for plant whole genome stochastic &/or non-stochasticmutagenesis. For example, if pollens from 4 parents are collected 4 andpooled, and then used to randomly pollinate the parents, the progeniesshould have 2⁴=16 possible combinations. Assuming this plant has 7chromosomes, microspores collected from the 16 progenies will represent27×16=2048 possible chromosomal combinations. This number is evengreater if meiotic processes occur. When diploid, homozygous embryos aregenerated from these microspores, in many cases, they are screened fordesired phenotypes, such as herbicide- or disease-resistant. Inaddition, for plant oil composition these embryos can be dissected intotwo halves: one for analysis the other for regeneration into a viableplant. Protoplasts generated from microspores (especially the haploidones) are pooled and fused. Microspores obtained from plants generatedby protoplast fusion are pooled and fused again, increasing the geneticdiversity of the resulting microspores. Microspores can be subjected tomutagenesis in various ways, such as by chemical mutagenesis,radiation-induced mutagenesis and, e.g., t-DNA transformation, prior tofusion or regeneration. New mutations which are generated can berecombined through the recursive processes described above and herein.

4.12.13 Acquistion of Salt Tolerance

DNA from a salt tolerant plant is isolated and used to create a genomiclibrary. Protoplasts made from the recipient species aretransformed/transfected with the genomic library (e.g., byelectroporation, agrobacterium, etc.). Cells are selected on media witha normally inhibitory level of NaCl. Only the cells with newly acquiredsalt tolerance will grow into callus tissue. The best lines are chosenand genomic libraries are made from their pooled DNA. These librariesare transformed into protoplasts made from the first round transformedcalli. Again, cells are selected on increased salt concentrations. Afterthe desired level of salt tolerance is achieved, the callus tissue canbe induced to regenerate whole plants. Progeny of these plants aretypically analyzed for homozygosity of the inserts to ensure stabilityof the acquired trait. At the indicated steps, plant regeneration orisolation and stochastic &/or non-stochastic mutagenesis of theintroduced genes can be added to the overall protocol.

4.13 Evolve Transgenic Animals

4.13.1 Optimize Transgene

One goal of transgenesis is to produce transgenic animals, such as mice,rabbits, sheep, pigs, goats, and cattle, secreting a recombinant proteinin the milk. A transgene for this purpose typically comprises inoperable linkage a promoter and an enhancer from a milk-protein gene(e.g., alpha, beta, or gamma casein, beta-lactoglobulin, acid wheyprotein or alpha-lactalbumin), a signal sequence, a recombinant proteincoding sequence and a transcription termination site.

Optionally, a transgene can encode multiple chains of a multichainprotein, such as an immunoglobulin, in which case, the two chains areusually individually operably linked to sets of regulatory sequences.Transgenes can be optimized for expression and secretion by recursivesequence recombination. Suitable substrates for recombination includeregulatory sequences such as promoters and enhancers from milk-proteingenes from different species or individual animals. Cycles ofrecombination can be performed in vitro or in vivo by any of the formatsdiscussed. Screening is performed in vivo on cultures of mammary-glandderived cells, such as HCII or MacT, transfected with transgenes andreporter constructs such as those discussed above. After several cyclesof recombination and screening, transgenes resulting in the highestlevels of expression and secretion are extracted from the mammary glandtissue culture cells and used to transfect embryonic cells, such aszygotes and embryonic stem cells, which are matured into transgenicanimals.

4.13.2 Optimize Whole Animal by Transforming into Embryonic Cells Geneof Desired Trait

4.13.2.1 Growth Hormone

In this approach, libraries of incoming fragments are transformed intoembryonic cells, such as ES cells or zygotes. The fragments can bevariants of a gene known to confer a desired property, such as growthhormone. Alternatively, the fragments can be partial or complete genomiclibraries including many genes. Fragments are usually introduced intozygotes by microinjection as described by Gordon et al., MethodsEnzymol. 10 1, 414 (1984); Hogan et al., Manipulation of the MouseEmbryo: A Laboratory Manual (C.S.H.L. N.Y., 1986) (mouse embryo),- andHammer et al., Nature 315, 680 (1985) (rabbit and porcine embryos);Gandolfi et al., J Reprod. Fert. 81, 23-28 (1987); Rexroad et al., JAnim. Sci. 66, 947-953 (1988) (ovine embryos) and Eyestone et al., JReprod. Fert. 85, 715-720 (1989); Camous et al., J Reprod. Fert. 72,779-785 (1984); and Heyman et al., Theriogenology 27, 5968 (1987)(bovine embryos). Zygotes are then matured and introduced into recipientfemale animals which gestate the embryo and give birth to a transgenicoffspring.

Alternatively, transgenes can be introduced into embryonic stem cells(ES).

These cells are obtained from preimplantation embryos cultured in vitro.Bradley et al., Nature 309, 255-258 (1984). Transgenes can be introducedinto such cells by electroporation or microinjection. Transformed EScells are combined with blastocysts from a non-human animal. The EScells colonize the embryo and in some embryos form the germ line of theresulting chimeric animal. See Jaenisch, Science, 240, 1468-1474 (1988).

Regardless whether zygotes or ES are used, screening is performed onwhole animals for a desired property, such as increased size and/orgrowth rate. DNA is extracted from animals having evolved towardacquisition of the desired property. This DNA is then used to transfectfurther embryonic cells. These cells can also be obtained from animalsthat have acquired toward the desired property in a split and poolapproach. That is, DNA from one subset of such animals is transformedinto embryonic cells prepared from another subset of the animals.Alternatively, the DNA from animals that have evolved toward acquisitionof the desired property can be transfected into fresh embryonic cells.In either alternative, transfected cells are matured into transgenicanimals, and the animals subjected to a further round of screening forthe desired property.

Initially, a library is prepared of variants of a growth hormone gene.The variants can be natural or induced. The library is coated with recAprotein and transferred into fertilized fish eggs. The fish eggs thenmature into fish of different sizes. The growth hormone gene fragment ofgenomic DNA from large fish is then amplified by PCR and used in thenext round of recombination. Alternatively, fish -IFN involved toenhance resistance to viral infections as described below.

4.13.2.2 Evolution of Improved Hormones for Expansion in TransgenicAnimals

4.13.3 To Create Animals with Improved Traits

Evolution of improved hormones for expression in transgenic animals(e.g., Fish) to create animals with improved traits. Hormones andcytokines are key regulators of size, body weight, viral resistance andmany other commercially important traits. DNA stochastic &/ornon-stochastic mutagenesis is used to rapidly evolve the genes for theseproteins using in vitro assays. This was demonstrated with the evolutionof the human alpha interferon genes to have potent antiviral activity onmurine cells. Large improvements in activity were achieved in two cyclesof family stochastic &/or non-stochastic mutagenesis of the human IFNgenes.

In general, a method of increasing resistance to virus infection incells can be performed by first introducing a stochastic &/ornon-stochastic mutagenized library comprising at least one stochastic&/or non-stochastic mutagenized interferon gene into animal cells tocreate an initial library of animal cells or animals. The initiallibrary is then challenged with the virus. Animal cells or animals areselected from the initial library which are resistant to the virus and aplurality of transgenes from a plurality of animal cells or animalswhich are resistant to the virus are recovered. The plurality oftransgenes is recovered to produce an evolved library of animal cells oranimals which is again challenged with the virus. Cells or animals areselected from the evolved library the which are resistant to the virus.

For example, genes evolved with in vitro assays are introduced into thegermplasm of animals or plants to create improved strains. Onelimitation of this procedure is that in vitro assays are often onlycrude predictors of in vivo activity. However, with improving methodsfor the production of transgenic plants and animals, one can now marrywhole organism breeding with molecular breeding. The approach is tointroduce stochastic &/or non-stochastic mutagenized libraries ofhormone genes into the species of interest. This can be done with asingle gene per transgenic or with pools of genes per transgenic.Progeny are then screened for the phenotype of interest. In this case,stochastic &/or non-stochastic mutagenized libraries of interferon genes(alpha IFN for example) are introduced into transgenic fish. The libraryof transgenic fish are challenged with a virus. The most resistant fishare identified (i.e. either survivors of a lethal challenge; or thosethat are deemed most ‘healthy’ after the challenge). The IFN transgenesare recovered by PCR and stochastic &/or non-stochastic mutagenized ineither a poolwise or a pairwise fashion. This generates an evolvedlibrary of IFN genes. A second library of transgenic fish is created andthe process is repeated. In this way, IFN is evolved for improvedantiviral activity in a whole organism assay. This procedure is generaland can be applied to any trait that is affected by a gene or genefamily of interest and which can be quantitatively measured.

Fish interferon sequence data is available for the Japanese flatfish(Paralichthys olivaceus) as mRNA sequence (Tamai et al (1993) “Cloningand expression of flatfish (Paralichthys olivaceus) interferon cDNA.”Biochem. Biophs. Act 1174, 182-186; Y see also, Tami et al. (1993)“Purification and characterization of interferon-like antiviral proteinderived from flatfish (Paralichthys olivaceus) lymphocytes immortalizedby oncogenes.” Cytotechnology 1993; 11 (2): 121-131). This sequence canbe used to clone out IFN genes from this species. This sequence can alsobe used as a probe to clone homologous interferons from additionalspecies of fish. As well, additional sequence information can beutilized to clone out more species of fish interferons. Once a libraryof interferons has been cloned, these can be family stochastic &/ornon-stochastic mutagenized to generate a library of variants.

In one embodiment, BHK-21 (A fibroblast cell line from hamster) can betransfected with the stochastic &/or non-stochastic mutagenizedWN-expression plasmids. Active recombinant IFN is produced and thenpurified by WGA agarose affinity chromatography (Tamai, et al. 1993Biochim Ciophys Acta. supra). The antiviral activity of IFN can bemeasured on fish cells challenged by rhabdovirus. Tami et al. (1993)“Purification and characterization of interferon-like antiviral proteinderived from flatfish (Paralichthys olivaceus) lymphocytes immortalizedby oncogenes.” Cytotechnolgy 1993; 11 (2): 121-131).

4.13.4 Whole Genome Stochastic &/or Non-Stochastic Mutagenesis in HigherOrganisms Poolwise Recursive Breeding

The present invention provides a procedure for generating largecombinatorial libraries of higher eukaryotes, plants, fish, domesticatedanimals, etc. In addition to the procedures outlined above, poolwisecombination of male and female gametes can also be used to generatelarge diverse molecular libraries.

In one aspect, the process includes recursive poolwise matings forseveral generations without any deliberate screening. This is similar toclassical breeding, except that pools of organisms, rather than pairs oforganisms, are mated, thereby accelerating the generation of geneticdiversity. This method is similar to recursive fusion of a diversepopulation of bacterial protoplasts resulting in the generation ofmultiparent progeny harboring genetic information from all of thestarting population of bacteria. The process described here is toperform analogous artificial or natural matings of large populations ofnatural isolates, imparting a split pool mating strategy. Before mating,all of the male gametes i.e. pollen, sperm, etc., are isolated from thestarting population and pooled. These are then used to “self” fertilizea mixed pool of the female gametes from the same population.

The process is repeated with the subsequent progeny for severalgenerations, with the final progeny being a combinatorial organismlibrary with each member having genetic information originating frommany if not all of the starting “parents.” This process generates largediverse organism libraries on which many selections and or screens canbe imparted, and it does not require sophisticated in vitro manipulationof genes. However, it results in the creation of useful new strains(perhaps well diluted in the population) in a much shorter time framethan such organisms could be generated using a classical targetedbreeding approach.

These libraries are generated relatively quickly (e.g., typically inless than three years for most plants of commercial interest, with sixcycles or less of recursive breeding being sufficient to generatedesired diversity). An additional benefit of these methods is that theresulting libraries provide organismal diversity in areas, such asagriculture, aquaculture, and animal husbandry, that are currentlygenetically homogeneous.

Examples of these methods for several organisms are described below.

4.13.5 Plants

Plants A population of plants, for example all of the different cornstrains in a commercial seed/germplasm collection, are grown and thepollen from the entire population is harvested and pooled. This mixedpollen population is then used to “self” fertilize the same population.Self pollination is prevented, so that the fertilization iscombinatorial. The cross results in all pairwise crosses possible withinthe population, and the resulting seeds result in many of the possibleoutcomes of each of these pairwise crosses. The seeds from thefertilized plants are then harvested, pooled, planted, and the pollen isagain harvested, pooled, and used to “self fertilize the population.After only several generations, the resulting population is a verydiverse combinatorial library of corn. The seeds from this library areharvested and screened for desirable traits, e.g., salt tolerance,growth rate, productivity, yield, disease resistance, etc. Essentiallyany plant collection can be modified by this approach. Importantcommercial crops include both monocots and dicots. Monocots includeplants in the grass family (Gramineae), such as plants in the subfamilies Fetucoideae and Poacoideae, which together include severalhundred genera including plants in the genera Agrostis, Phleum,Dactylis, Sorgum, Setaria, Zea (e.g., corn), Oryza (e.g., rice),Triticum (e.g., wheat), Secale (e.g., rye), Avena (e.g., oats), Hordeum(e.g., barley), Saccharum, Poa, Festuca, Stenotaphrum, Cynodon, Coix,the Olyreae, Phareae and many others. Plants in the family Gramineae area particularly preferred target plants for the methods of the invention.

Additional preferred targets include other commercially important crops,e.g., from the families Compositae (the largest family of vascularplants, including at least 1,000 genera, including important commercialcrops such as sunflower), and Leguminosae or “pea family,” whichincludes several hundred genera, including many commercially valuablecrops such as pea, beans, lentil, peanut, yam bean, cowpeas, velvetbeans, soybean, clover, alfalfa, lupine, vetch, lotus, sweet clover,wisteria, and sweetpea. Common crops applicable to the methods of theinvention include Zea mays, rice, soybean, sorghum, wheat, oats, barley,millet, sunflower, and canola.

This process can also be carried out using pollen from different speciesor more divergent strains (e.g., crossing the ancient grasses withcorn). Different plant species can be forced to cross. Only a few plantsfrom an initial cross would have to result in order to make the processviable. These few progeny, e.g., from a cross between soy bean and corn,would generate pollen and eggs, each of which would represent adifferent meiotic outcome from the recombination of the two genomes. Thepollen would be harvested and used to “self pollinate the originalprogeny. This process would then be carried out recursively. This wouldgenerate a large family stochastic &/or non-stochastic mutagenizedlibrary of two or more species, which could be subsequently screened.

4.13.6 Fish

Fish The natural tendency of fish to lay their eggs outside of the bodyand to have a male cover those eggs with sperm provides anotheropportunity for a split pooled breeding strategy. The eggs from manydifferent fish, e.g., salmon from different fisheries about the world,can be harvested, pooled, and then fertilized with similarly collectedand pooled salmon sperm. The fertilization will result in all of thepossible pairwise matings of the starting population. The resultingprogeny is then grown and again the sperm and eggs are harvested, andpooled, with each egg and sperm representing a different meiotic outcomeof the different crosses. The pooled sperm are then used to fertilizethe pooled eggs and the process is carried out recursively. Afterseveral generations the resulting progeny can then be subjected toselections and screens for desired properties, such as size, diseaseresistance, etc.

4.13.7 Animals

Animals The advent of in vitro fertilization and surrogate motherhoodprovides a means of whole genome stochastic &/or non-stochasticmutagenesis in animals such as mammals. As with fish, the eggs and thesperm from a population, for example from all slaughter cows, arecollected and pooled. The pooled eggs are then in vitro fertilized withthe pooled sperm. The resulting embryos are then returned to surrogatemothers for development. As above, this process is repeated recursivelyuntil a large diverse population is generated that can be screened fordesirable traits.

A technically feasible approach would be similar to that used forplants. In this case, sperm from the males of the starting population iscollected and pooled, and then this pooled sample is used toartificially inseminate multiple females from each of the startingpopulations. Only one (or a few) sperm would succeed in each animal, butthese should be different for each fertilization. The process isreiterated by harvesting the sperm from all of the male progeny, poolingit, and using it to fertilize all of the female progeny. The process iscarried out recursively for several generations to generate the organismlibrary, which can then be screened.

4.14 Predictive Tool in Looking for Drugs

Recursive sequence recombination can be used to simulate naturalevolution of pathogenic microorganisms in response to exposure to a drugunder test. Using recursive sequence recombination, evolution proceedsat a faster rate than in natural evolution. One measure of the rate ofevolution is the number of cycles of recombination and screeningrequired until the microorganism acquires a defined level of resistanceto the drug. The information from this analysis is of value in comparingthe relative merits of different drugs and in particular, in predictingtheir long term efficacy on repeated administration.

The pathogenic microorganisms used in this analysis include the bacteriathat are a common source of human infections, such as chlamydia,rickettsial bacteria, mycobacteria, staphylococci, streplocci,pneumonococci, meningococci and conococci, klebsiella, proteus,serratia, pseudomonas, legionella, diphtheria, salmonella, bacilli,cholera, tetanus, botulism, anthrax, plague, leptospirosis, and Lymesdisease bacteria.

Evolution is effected by transforming an isolate of bacteria that issensitive to a drug under test with a library of DNA fragments. Thefragments can be a mutated version of the genome of the bacteria beingevolved. If the target of the drug is a known protein or nucleic acid, afocused library containing variants of the corresponding gene can beused. Alternatively, the library can come from other kinds of bacteria,especially bacteria typically found inhabiting human tissues, therebysimulating the source material available for recombination in vivo. Thelibrary can also come from bacteria known to be resistant to the drug.After transformation and propagation of bacteria for an appropriateperiod to allow for recombination to occur and recombinant genes to beexpressed, the bacteria are screened by exposing them to the drug undertest and then collecting survivors. Surviving bacteria are subject tofurther rounds of recombination. The subsequent round can be effected bya split and pool approach in which DNA from one subset of survivingbacteria is introduced into a second subset of bacteria. Alternatively,a fresh library of DNA fragments can be introduced into survivingbacteria. Subsequent round(s) of selection can be performed atincreasing concentrations of drug, thereby increasing the stringency ofselection.

4.14.1 Biosynthesis

Metabolic engineering can be used to alter organisms to optimize theproduction of practically any metabolic intermediate, includingantibiotics, vitamins, amino acids such as phenylalanine and aromaticamino acids, ethanol, butanol, polymers such as xanthan gum andbacterial cellulose, peptides, and lipids. When such compounds arealready produced by a host, the recursive sequence recombinationtechniques described above can be used to optimize production of thedesired metabolic intermediate, including such features as increasingenzyme substrate specificity and turnover number, altering metabolicfluxes to reduce the concentrations of toxic substrates orintermediates, increasing resistance of the host to such toxiccompounds, eliminating, reducing or altering the need for inducers ofgene expression/activity, increasing the production of enzymes necessaryfor metabolism, etc.

Enzymes can also be evolved for improved activity in solvents other thanwater. This is useful because intermediates in chemical syntheses areoften protected by blocking groups which dramatically affect thesolubility of the compound in aqueous solvents. Many compounds can beproduced by a combination of pure chemical and enzymically catalyzedreactions. Performing enzymic reactions on almost insoluble substratesis clearly very inefficient, so the availability of enzymes that areactive in other solvents will be of great use. One example of such ascheme is the evolution of a para-nitrobenzyl esterase to removeprotecting groups from an intermediate in loracarbef synthesis (Moore,J. C. and Arnold, F. H. Nature Biotechnology 14: 458467 (1996)). In thiscase alternating rounds of error-prone PCR and colony screening forproduction of a fluorescent reporter from a substrate analogue were usedto generate a mutant esterase that was 16-fold more active than theparent molecule in 30%. dimethylformamide. No individual mutation wasfound to contribute more than a 2-fold increase in activity, but it wasthe combination of a number of mutations which led to the overallincrease.

Structural analysis of the mutant protein showed that the amino acidchanges were distributed throughout the length of the protein in amanner that could not have been rationally predicted. Sequential roundsof error-prone PCR have the problem that after each round all but onemutant is discarded, with a concomitant loss of information contained inall the other beneficial mutations. Recursive sequence recombinationavoids this problem, and would thus be ideally suited to evolvingenzymes for catalysis in other solvents, as well as in conditions wheresalt concentrations or pH were different from the original enzymeoptimas.

In addition, the yield of almost any metabolic pathway can be increased,whether consisting entirely of genes endogenous to the host organisms orall or partly heterologous genes. Optimization of the expression levelsof the enzymes in a pathway is more complex than simply maximizingexpression. In some cases regulation, rather than constitutiveexpression of an enzyme may be advantageous for cell growth andtherefore for product yield, as seen for production of phenylalanine(Backman et al. Ann. NY Acad. Sci. 589: 16-24 (1990)) and2-keto-L-gluconic acid (Anderson et al. U.S. Pat. No. 5,032,514). Inaddition, it is often advantageous for industrial purposes to expressproteins in organisms other than their original hosts. New host strainsmay be preferable for a variety of reasons, including ease of cloningand transformation, pathogenicity, ability to survive in particularenvironments and a knowledge of the physiology and genetics of theorganisms. However, proteins expressed in heterologous organisms oftenshow markedly reduced activity for a variety of reasons includinginability to fold properly in the new host (Sarthy et al. Appl. Environ.Micro. 53: 1996-2000 (1987)). Such difficulties can indeed be overcomeby the recursive sequence recombination strategies of the instantinvention.

4.14.2 Antibiotics

The range of natural small molecule antibiotics includes but is notlimited to peptides, peptidolactones, thiopeptides, beta-lactams,glycopeptides, lantibiotics, microcins, polyketide-derived antibiotics(anthracyclins, tetracyclins, macrolides, avennectins, polyethers andansamycins), chloramphenicol, aminoglycosides, aminocyclitols,polyoxins, agrocins and isoprenoids. There are at least three ways inwhich recursive sequence recombination techniques of the instantinvention can be used to facilitate novel drug synthesis, or to improvebiosynthesis of existing antibiotics.

First, antibiotic synthesis enzymes can be “evolved” together withtransport systems that allow entry of compounds used as antibioticprecursors to improve uptake and incorporation of function-alteringartificial side chain precursors. For example, penicillin V is producedby feeding Penicillium the artificial side chain precursor phenoxyaceticacid, and LY146032 by feeding Streptomyces roseosporus decanoic acid(Hopwood, Phil. Trans. R. Soc. Lond. B 324: 549-562 (1989)). Poorprecursor uptake and poor incorporation by the synthesizing enzyme oftenlead to inefficient formation of the desired product. Recursive sequencerecombination of these two systems can increase the yield of desiredproduct.

Furthermore, a combinatorial approach can be taken in which an enzyme isstochastic &/or non-stochastic mutagenized for novel catalyticactivity/substrate recognition (perhaps by including randomizingoligonucleotides in key positions such as the active site). A number ofdifferent substrates (for example, analogues of side chains that arenormally incorporated into the antibiotic) can then be tested incombination with all the different enzymes and tested for biologicalactivity. In this embodiment, plates are made containing differentpotential antibiotic precursors (such as the side chain analogues). Themicroorganisms containing the stochastic &/or non-stochastic mutagenizedlibrary (the library strain) are replicated onto those plates, togetherwith a competing, antibiotic sensitive, microorganism (the indicatorstrain). Library cells that are able to incorporate the new side chainto produce an effective antibiotic will thus be able to compete with theindicator strain, and will be selected for.

Second, the expression of heterologous genes transferred from oneantibiotic synthesizing organism to another can be optimized. The newlyintroduced enzyme(s) act on secondary metabolites in the host cell,transforming them into new compounds with novel properties. Usingtraditional methods, introduction of foreign genes into antibioticsynthesizing hosts has already resulted in the production of novelhybrid antibiotics. Examples include mederrhodin, dihydrogranatirhodin,6-deoxyerythromycin A, isovalerylspiramycin and other hybrid macrolides(Cameron et. al. Appl. Biochem. Biotechnol. 38: 105-140 (1993)). Therecursive sequence recombination techniques of the instant invention canbe used to optimize expression of the foreign genes, to stabilize theenzyme in the new host cell, and to increase the activity of theintroduced enzyme against its new substrates in the new host cell. Insome embodiments of the invention, the host genome may also be sooptimized.

Third, the substrate specificity of an enzyme involved in secondarymetabolism can be altered so that it will act on and modify a newcompound or so that its so activity is changed and it acts at adifferent subset of positions of its normal substrate. Recursivesequence recombination can be used to alter the substrate specificitiesof enzymes. Furthermore, in addition to recursive sequence recombinationof individual enzymes being a strategy to generate novel antibiotics,recursive sequence recombination of entire pathways, by altering enzymeratios, will alter metabolite fluxes and may result, not only inincreased antibiotic synthesis, but also in the synthesis of differentantibiotics. This can be deduced from the observation that expression ofdifferent genes from the same cluster in a foreign host leads todifferent products being formed (see p. 80 in Hutchinson et. al., (1991)Ann NY Acad Sci, 646: 78-93).

Recursive sequence recombination of the introduced gene clusters mayresult in a variety of expression levels of different proteins withinthe cluster (because it produces different combinations of, in this caseregulatory, mutations). This in turn may lead to a variety of differentend products. Thus, “evolution” of an existing antibiotic synthesizingpathway could be used to generate novel antibiotics either by modifyingthe rates or substrate specificities of enzymes in that pathway.

Additionally, antibiotics can also be produced in vitro by the action ofa purified enzyme on a precursor. For example isopenicillin N synthasecatalyses the cyclization of many analogues of its normal substrate(d-(L-a-aminoadipyl)L-cysteinyl-D-valine) (Hutchinson, Med. Res. Rev. 8:557-567 (1988)). Many of these products are active as antibiotics. Awide variety of substrate analogues can be tested for incorporation bysecondary metabolite synthesizing enzymes without concern for theinitial efficiency of the reaction. Recursive sequence recombination canbe used subsequently to increase the rate of reaction with a promisingnew substrate.

Thus, organisms already producing a desired antibiotic can be evolvedwith the recursive sequence recombination techniques described above tomaximize production of that antibiotic. Additionally, new antibioticscan be evolved by manipulation of genetic material from the host by therecursive sequence recombination techniques described above. Genes forantibiotic production can be transferred to a preferred host aftercycles of recursive sequence recombination or can be evolved in thepreferred host as described above.

Antibiotic genes are generally clustered and are often positivelyregulated, making them especially attractive candidates for therecursive sequence recombination techniques of the instant invention.Additionally, some genes of related pathways show cross-hybridization,making them preferred candidates for the generation of new pathways fornew antibiotics by the recursive sequence recombination techniques ofthe invention.

Furthermore, increases in secondary metabolite production includingenhancement of substrate fluxes (by increasing the rate of a ratelimiting enzyme, deregulation of the pathway by suppression of negativecontrol elements or over expression of activators and the relief offeedback controls by mutation of the regulated enzyme to afeedback-insensitive deregulated protein) can be achieved by recursivesequence recombination without exhaustive analysis of the regulatorymechanisms governing expression of the relevant gene clusters.

The host chosen for expression of evolved genes is preferably resistantto the antibiotic produced, although in some instances productionmethods can be designed so as to sacrifice host cells when the amount ofantibiotic produced is commercially significant yet lethal to the host.Similarly, bioreactors can be designed so that the growth medium iscontinually replenished, thereby “drawing off” antibiotic produced andsparing the lives of the producing cells. Preferably, the mechanism ofresistance is not the degradation of the antibiotic produced.

Numerous screening methods for increased antibiotic expression are knownin the art, as discussed above, including screening for organisms thatare more resistant to the antibiotic that they produce. This may resultfrom linkage between expression of the antibiotic synthesis andantibiotic resistance genes (Chater, Bio/Technology 8: 115-121 (1990)).Another screening method is to fuse a reporter gene (e.g. xylE from thePseudomonas TOL plasmid) to the antibiotic production genes. Antibioticsynthesis gene expression can then be measured by looking for expressionof the reporter (e.g. xylE encodes a catechol dioxygenase which producesyellow muconic semialdehyde when colonies are sprayed with catechol(Zukowski et al. Proc. Natl. Acad. Sci. U.S.A. 80: 1101-1105 (1983)).The wide variety of cloned antibiotic genes provides a wealth ofstarting materials for the recursive sequence recombination techniquesof the instant invention. For example, genes have been cloned fromStreptomyces cattleya which direct cephamycin C synthesis in thenon-antibiotic producer Streptomyces lividans (Chen et al.Bio/Technology 6: 1222-1224 (1988)). Clustered genes for penicillinbiosynthesis (-(L—aminoadipyl)-L-cysteinyl-D-valine synthetase;isopenicillin N synthetase and acyl coenzyme A:6-aminopenicillanic acidacyltransferase) have been cloned from Penicillium chrysogenum. Transferof these genes into Neurospora crassa and Aspergillus niger result inthe synthesis of active penicillin V (Smith et al. Bio/Technology 8:39-41 (1990)). For a review of cloned genes involved in Cephalosporin C,Penicillins G and V and Cephamycin C biosynthesis, see Piepersberg,Crit. Rev. Biotechnol. 14: 251-285 (1994). For a review of clonedclusters of antibiotic-producing genes, see Chater Bio/Technology 8:115-121 (1990). Other examples of antibiotic synthesis genes transferredto industrial producing strains, or over expression of genes, includetylosin, cephamycin C, cephalosporin C, LL-E33288 complex (an antitumorand antibacterial agent), doxorubicin, spiramycin and other macrolideantibiotics, reviewed in Cameron et al. Appl. Biochem. Biotechnol. 38:105-140 (1993).

4.14.3 Biosythsesis to Replace Chemical Synthesis of Antibiotics

Some antibiotics are currently made by chemical modifications ofbiologically produced starting compounds. Complete biosynthesis of thedesired molecules may currently be impractical because of the lack of anenzyme with the required enzymatic activity and substrate specificity.For example, 7-aminodeacetooxycephalosporanic acid (7-ADCA) is aprecursor for semi-synthetically produced cephalosporins. 7-ADCA is madeby a chemical ring expansion from penicillin V followed by enzymaticdeacylation of the phenoxyacetal group. Cephalosporins could inprinciple be produced biologically from their corresponding penicillins(e.g., cephalosporin V or G from penicillin V or G) using penicillin Nexpandase, but other penicillins (such as penicillin V or G) are notused as substrates by known expandases. The recursive sequencerecombination techniques of the invention can be used to alter theenzyme so that it will use penicillin V as a substrate. Similarly,penicillin transacylase could be so modified to accept cephalosporins orcephamycins as substrates.

In yet another example, penicillin amidase expressed in E. coli is a keyenzyme in the production of penicillin G derivatives. The enzyme isgenerated from a precursor peptide and tends to accumulate as insolubleaggregates in the periplasm unless non-metabolizable sugars are presentin the medium (Scherrer et al., Appl. Microbiol. Biotechnol. 42: 85-91(1994)). Evolution of this enzyme through the methods of the instantinvention could be used to generate an enzyme that folds better, leadingto a higher level of active enzyme expression.

In yet another example, Penicillin G acylase covalently linked toagarose is used in the synthesis of penicillin G derivatives. The enzymecan be stabilized for increased activity, longevity and/or thermalstability by chemical modification (Fernandez-Lafuente et. al. EnzymeMicrob. Technol. 14: 489495 (1992). Increased thermal stability is anespecially attractive application of the recursive sequencerecombination techniques of the instant invention, which can obviate theneed for the chemical modification of such enzymes. Selection forthermostability can be performed in vivo in E. coli or in thermophilesat higher temperatures. In general, thermostability is a good first stepin enhancing general stabilization of enzymes. Random mutagenesis andselection can also be used to adapt enzymes to function in non-aqueoussolvents (Arnold Curr Opin Biotechnol, 4: 450455 (1993); Chen et. al.Proc. Natl. Acad. Sci. U.S.A., 90: 5618-5622 (1993)). Recursive sequencerecombination represents a more powerful (since recombinogenic) methodof generating mutant enzymes that are stable and active in non-aqueousenvironments. Additional screening can be done on the basis of enzymestability in solvents.

4.14.4 Polyketides

Polyketides include antibiotics such as tetracycline and erythromycin,anti-cancer agents such as daunomycin, immunosuppressants such as FK506and rapamycin and veterinary products such as monesin and avermectin.Polyketide synthases (PKS's) are multifunctional enzymes that controlthe chain length, choice of chain-building units and reductive cyclethat generates the huge variation in naturally occurring polyketides.Polyketides are built up by sequential transfers of “extender units”(fatty acyl CoA groups) onto the appropriate starter unit (examples areacetate, coumarate, propionate and malonamide). The PKS's determine thenumber of condensation reactions and the type of extender groups addedand may also fold and cyclize the polyketide precursor. PKS's reducespecific -keto groups and may dehydrate the resultant-hydroxyls to formdouble bonds. Modifications of the nature or number of building blocksused, positions at which -keto groups are reduced, the extent ofreduction and different positions of possible cyclizations, result information of different final products. Polyketide research is currentlyfocused on modification and inhibitor studies, site directed mutagenesisand 3-D structure elucidation to lay the groundwork for rational changesin enzymes that will lead to new polyketide products.

Recently, McDaniel et al. (Science 262: 1546-1550 (1995)) have developeda Streptomyces host-vector system for efficient construction andexpression of recombinant PKSs. Hutchinson (Bio/Technolo 12: 375-308(1994)) reviewed targeted mutation of specific biosynthetic genes andsuggested that microbial isolates can be screened by DNA hybridizationfor genes associated with known pharmacologically active agents so as toprovide new metabolites or increased yields of metabolites already beingproduced. In particular, that review focuses on polyketide synthase andpathways to aminoglycoside and oligopeptide antibiotics.

The recursive sequence recombination techniques of the instant inventioncan be used to generate modified enzymes that produce novel polyketideswithout such detailed analytical effort. The availability of the PKSgenes on plasmids and the existence of E. coli-Streptomyces shuttlevectors (Wehmeier Gene 165: 149-150 (1995)) makes the process ofrecursive sequence recombination especially attractive by the techniquesdescribed above. Techniques for selection of antibiotic producingorganisms can be used as described above; additionally, in someembodiments screening for a particular desired polyketide activity orcompound is preferable.

4.14.5 Isoprenoids

Isoprenoids result from cyclization of farnesyl pyrophosphate bysesquiterpene synthases. The diversity of isoprenoids is generated notby the backbone, but by control of cyclization. Cloned examples ofisoprenoid synthesis genes include trichodiene synthase from Fusariumsprorotrichioides, pentalene synthase from Streptomyces, aristolochenesynthase from Penicillium roquefortii, and epi-aristolochene synthasefrom N. tabacum (Cane, D. E. (1995). Isoprenoid antibiotics, pages633-655, in “Genetics and Biochemistry of Antibiotic Production” editedby Vining, L. C. & Stuttard, C., published by Butterworth-Heinemann).Recursive sequence recombination of sesquiterpene synthases will be ofuse both in allowing expression of these enzymes in heterologous hosts(such as plants and industrial microbial strains) and in alteration ofenzymes to change the cyclized product made. A large number ofisoprenoids are active as antiviral, antibacterial, antifungal,herbicidal, insecticidal or cytostatic agents. Antibacterial andantifungal isoprenoids could thus be preferably screened for using theindicator cell type system described above, with the producing cellcompeting with bacteria or fungi for nutrients. Antiviral isoprenoidscould be screened for preferably by their ability to confer resistanceto viral attack on the producing cell.

4.14.6 Bioactive Peptide Derivatlves

Examples of bioactive non-ribosomally synthesized peptides include theantibiotics cyclosporin, pepstatin, actinomycin, gramicidin,depsipeptides, vancomycin, etc. These peptide derivatives aresynthesized by complex enzymes rather than ribosomes. Again, increasingthe yield of such non-ribosomally synthesized peptide antibiotics hasthus far been done by genetic identification of biosynthetic“bottlenecks” and over expression of specific enzymes (See, for example,p. 133-135 in “Genetics and Biochemistry of Antibiotic Production”edited by Vining, L. C. & Stuttard, C., published byButterworth-Heinemann). Recursive sequence recombination of the enzymeclusters can be used to improve the yields of existing bioactivenon-ribosomally made peptides in both natural and heterologous hosts.

Like polyketide synthases, peptide synthases are modular andmultifunctional enzymes catalyzing condensation reactions betweenactivated building blocks (in this case amino acids) followed bymodifications of those building blocks (see Kleinkauf, H. and vonDohren, H. Eur. J. Biochem. 236: 335-351 (1996)). Thus, as forpolyketide synthases, recursive sequence recombination can also be usedto alter peptide synthases: modifying the specificity of the amino acidrecognized by each binding site on the enzyme and altering the activityor substrate specificities of sites that modify these amino acids toproduce novel compounds with antibiotic activity. other peptideantibiotics are made ribosomally and then post-translationally modified.Examples of this type of antibiotics are lantibiotics (produced by grampositive bacteria such Staphylococcus, Streptomyces, Bacillus, andActinoplanes) and microcins (produced by Enterobacteriaceae).Modifications of the original peptide include (in lantibiotics)dehydration of serine and threonine, condensation of dehydroamino acidswith cysteine, or simple N- and C-terminal blocking (microcins). Forribosomally made antibiotics both the peptide-encoding sequence and themodifying enzymes may have their expression levels modified by recursivesequence recombination. Again, this will lead to both increased levelsof antibiotic synthesis, and by modulation of the levels of themodifying enzymes (and the sequence of the ribosomally synthesizedpeptide itself) novel antibiotics.

Screening can be done as for other antibiotics as described above,including competition with a sensitive (or even initially insensitive)microbial species. Use of competing bacteria that have resistances tothe antibiotic being produced will select strongly either for greatlyelevated levels of that antibiotic (so that it swamps out the resistancemechanism) or for novel derivatives of that antibiotic that are notneutralized by the resistance mechanism.

4.14.7 Polymers

Several examples of metabolic engineering to produce biopolymers havebeen reported, including the production of the biodegradable plasticpolyhydroxybutarate (PHB), and the polysaccharide xanthan gum. For areview, see Cameron et al. Applied Biochem. Biotech. 38: 105-140 (1993).Genes for these pathways have been cloned, making them excellentcandidates for the recursive sequence recombination techniques describedabove. Expression of such evolved genes in a commercially viable hostsuch as E. coli is an especially attractive application of thistechnology.

Examples of starting materials for recursive sequence recombinationinclude but are not limited to genes from bacteria such as Alcaligenes,Zoogloea, Ihizobium, Bacillus, and Azobacter, which producepolyhydroxyalkanoates (PHAs) such as polyhyroxybutyrate (PHB)intracellularly as energy reserve materials in response to stress. Genesfrom Alcaligenes eutrophus that encode enzymes catalyzing the conversionof acetoacetyl CoA to PHB have been transferred both to E. coli and tothe plant Arabidopsis thaliana (Poirier et al. Science 256: 520-523(1992)). Two of these genes (phbB and phbC, encoding acetoacetyl-CoAreductase and PHB synthase respectively) allow production of PHE inArabidopsis. The plants producing the plastic are stunted, probablybecause of adverse interactions between the new metabolic pathway andthe plants' original metabolism (i.e., depletion of substrate from themevalonate pathway). Improved production of PHB in plants has beenattempted by localization of the pathway enzymes to organelles such asplastids. Other strategies such as regulation of tissue specificity,expression timing and cellular localization have been suggested to solvethe deleterious effects of PHB expression in plants. The recursivesequence recombination techniques of the invention can be used to modifysuch heterologous genes as well as specific cloned interacting pathways(e.g., mevalonate), and to optimize PHB synthesis in industrialmicrobial strains, for example to remove the requirement for stresses(such as nitrogen limitation) in growth conditions.

Additionally, other microbial polyesters are made by different bacteriain which additional monomers are incorporated into the polymer (Peopleset al. in Novel Biodegradable Microbial Polymers, E A Dawes, ed., pp191-202 (1990)). Recursive sequence recombination of these genes orpathways singly or in combination into a heterologous host will allowthe production of a variety of polymers with differing properties,including variation of the monomer subunit ratios in the polymer.

Another polymer whose synthesis may be manipulated by recursive sequencerecombination is cellulose. The genes for cellulose biosynthesis havebeen cloned from Agrobacterium tumefaciens (Matthysse, A. G. et. al. J.Bacteriol. 177: 1069-1075 (1995)). Recursive sequence recombination ofthis biosynthetic pathway could be used either to increase synthesis ofcellulose, or to produce mutants in which alternative sugars areincorporated into the polymer.

4.14.8 Carotenoids

Carotenoids are a family of over 600 terpenoids produced in the generalisoprenoid biosynthetic pathway by bacteria, fungi and plants (for areview, see Armstrong, J. Bact. 176: 4795-4802 (1994)). These pigmentsprotect organisms against photooxidative damage as well as functioningas anti-tumor agents, free radical-scavenging anti-oxidants, andenhancers of the immune response. Additionally, they are usedcommercially in pigmentation of cultured fish and shellfish. Examples ofcarotenoids include but are not limited to myxobacton, spheroidene,spheroidenone, lutein, astaxanthin, violaxanthin, 4-ketorulene,myxoxanthrophyll, echinenone, lycopene, zeaxanthin and its mono- anddi-glucosides, alpha-, beta-, gamma- and sigma-carotene,beta-cryptoxanthin monoglucoside and neoxanthin.

Carotenoid synthesis is catalyzed by relatively small numbers ofclustered genes: 11 different genes within 12 kb of DNA from Myxococcusxanthus (Botella et al. Eur. J. Biochem. 233: 238-248 (1995)) and 8genes within 9 kb of DNA from Rhodobacter sphaeroides (Lang et. al. J.Bact. 177: 2064-2073 (1995)). In some microorganisms, such as Thermusthermophilus, these genes are plasmid-borne (Tabata et al. FEBS Letts341: 251-255 (1994)). These features make carotenoid synthetic pathwaysespecially attractive candidates for recursive sequence recombination.

Transfer of some carotenoid genes into heterologous organisms results inexpression. For example, genes from Erwina uredovora and Haematococcuspluvialis will function together in E. coli (Kajiwara et al. Plant Mol.Biol. 29: 343-352 (1995)). E. herbicola genes will function in R.sphaeroides (Hunter et al. J. Bact. 176: 3692-3697 (1994)). However,some other genes do not; for example, R. capsulatus genes do not directcarotenoid synthesis in E. coli (Marrs, J. Bact. 146: 1003-1012 (1981)).

In an embodiment of the invention, the recursive sequence recombinationtechniques of the invention can be used to generate variants in theregulatory and/or structural elements of genes in the carotenoidsynthesis pathway, allowing increased expression in heterologous hosts.Indeed, traditional techniques have been used to increase carotenoidproduction by increasing expression of a rate limiting enzyme in Thermusthermophilus (Hoshino et al. Appl. Environ. Micro. 59: 3150-3153(1993)). Furthermore, mutation of regulatory genes can causeconstitutive expression of carotenoid synthesis in actinomycetes, wherecarotenoid photoinducibility is otherwise unstable and lost at arelatively high frequency in some species (Kato et al. Mol. Gen. Genet.247: 387-390 (1995)). These are both mutations that can be obtained byrecursive sequence recombination.

The recursive sequence recombination techniques of the invention asdescribed above can be used to evolve one or more carotenoid synthesisgenes in a desired host without the need for analysis of regulatorymechanisms. Since carotenoids are colored, a calorimetric assay inmicrotiter plates, or even on growth media plates, can be used forscreening for increased production.

In addition to increasing expression of carotenoids, carotenogenicbiosynthetic pathways have the potential to produce a wide diversity ofcarotenoids, as the enzymes involved appear to be specific for the typeof reaction they will catalyze, but not for the substrate that theymodify. For example, two enzymes from the marine bacterium Agrobacteriumaurantiacum (CrtW and CrtZ) synthesize six different ketocarotenoidsfrom beta-carotene (Misawa et al. J. Bact. 177: 6576-6584 (1995)). Thisrelaxed substrate specificity means that a diversity of substrates canbe transformed into an even greater diversity of products. Introductionof foreign carotenoid genes into a cell can lead to novel and functionalcarotenoid-protein complexes, for example in photosynthetic complexes(Hunter et al. J. Bact. 176: 3692-3697 (1994)). Thus, the deliberaterecombination of enzymes through the recursive sequence recombinationtechniques of the invention is likely to generate novel compounds.Screening for such compounds can be accomplished, for example, by thecell competition/survival techniques discussed above and by acalorimetric assay for pigmented compounds.

Another method of identifying new compounds is to use standardanalytical techniques such as mass spectroscopy, nuclear magneticresonance, high performance liquid chromatography, etc. Recombinantmicroorganisms can be pooled and extracts or media supernatants assayedfrom these pools. Any positive pool can then be subdivided and theprocedure repeated until the single positive is identified(“sib-selection”).

4.14.9 Indigo Biosyntesis

Many dyes, i.e. agents for imparting color, are specialty chemicals withsignificant markets. As an example, indigo is currently producedchemically. However, nine genes have been combined in E. coli to allowthe synthesis of indigo from glucose via the tryptophan/indole pathway(Murdock et al. Bio/Technology 11: 381-386 (1993)). A number ofmanipulations were performed to optimize indigo synthesis: cloning ofnine genes, modification of the fermentation medium and directed changesin two operons to increase reaction rates and catalytic activities ofseveral enzymes.

Nevertheless, bacterially produced indigo is not currently an economicproposition. The recursive sequence recombination techniques of theinstant invention could be used to optimize indigo synthesizing enzymeexpression levels and catalytic activities, leading to increased indigoproduction, thereby making the process commercially viable and reducingthe environmental impact of indigo manufacture. Screening for increasedindigo production can be done by calorimetric assays of cultures inmicrotiter plates.

4.14.10 Amino Acids

Amino acids of particular commercial importance include but are notlimited to phenylalanine, monosodium glutamate, glycine, lysine,threonine, tryptophan and methionine. Backman et al. (Ann. NY Acad. Sci.589: 16-24 (1990)) disclosed the enhanced production of phenylalanine inE. coli via a systematic and downstream strategy covering organismselection, optimization of biosynthetic capacity, and development offermentation and recovery processes. As described in Simpson et al.(Biochem Soc Trans, 23: 381-387 (1995)), current work in the field ofamino acid production is focused on understanding the regulation ofthese pathways in great molecular detail.

The recursive sequence recombination techniques of the instant inventionwould obviate the need for this analysis to obtain bacterial strainswith higher secreted amino acid yields. Amino acid production could beoptimized for expression using recursive sequence recombination of theamino acid synthesis and secretion genes as well as enzymes at theregulatory phosphoenolpyruvate branchpoint, from such organisms asSerratia marcescens, Bacillus, and the Corynebacterium-Brevibacteriumgroup. In some embodiments of the invention, screening for enhancedproduction is preferably done in microtiter wells, using chemical testswell known in the art that are specific for the desired amino acid.Screening/selection for amino acid synthesis can also be done by usingauxotrophic reporter cells that are themselves unable to synthesize theamino acid in question. If these reporter cells also produce a compoundthat stimulates the growth of the amino acid producer (this could be agrowth factor, or even a different amino acid), then library cells thatproduce more amino acid will in turn receive more growth stimulant andwill therefore grow more rapidly.

4.14.11 Vitamin C Synthesis

L-Ascorbic acid (vitamin C) is a commercially important vitamin with aworld production of over 35,000 tons in 1984. Most vitamin C iscurrently manufactured chemically by the Reichstein process, althoughrecently bacteria have been engineered that are able to transformglucose to 2,5-keto-gluconic acid, and that product to 2-keto-L-idonicacid, the precursor to L-ascorbic acid (Boudrant, Enzyme Microb.Technol. 12: 322-329 (1990)).

The efficiencies of these enzymatic steps in bacteria are currently low.Using the recursive sequence recombination techniques of the instantinvention, the genes can be genetically engineered to create one or moreoperons followed by expression optimization of such a hybrid L-ascorbicacid synthetic pathway to result in commercially viable microbialvitamin C biosynthesis. In some embodiments, screening for enhancedL-ascorbic acid production is preferably done in microtiter plates,using assays well known in the art.

4.15 Test for Resistance to Drugs

4.15.1 Find Drugs that Induce Resistance Slowly

A similar strategy can be used to simulate viral acquisition of drugresistance. The object is to identify drugs for which resistance can beacquired only slowly, if at all. The viruses to be evolved are thosethat cause infections in humans for which at least modestly effectivedrugs are available. Substrates for recombination can come from inducedmutants, natural variants of the same viral strain or different viruses.If the target of the drug is known (e.g., nucleotide analogs whichinhibit the reverse transcriptase gene of HIV), focused librariescontaining variants of the target gene can be produced. Recombination ofa viral genome with a library of fragments is usually performed invitro. However, in situations in which the library of fragmentsconstitutes variants of viral genomes or fragments that can beencompassed in such genomes, recombination can also be performed invivo, e.g., by transfecting cells with multiple substrate copies (seeSection V). For screening, recombinant viral genomes are introduced intohost cells susceptible to infection by the virus and the cells areexposed to a drug effective against the virus (initially at lowconcentration). The cells can be spun to remove any noninfected virus.After a period of infection, progeny viruses can be collected from theculture medium, the progeny viruses being enriched for viruses that haveacquired at least partial resistance to the drug. Alternatively, virallyinfected cells can be plated in a soft agar lawn and resistant virusesisolated from plaques. Plaque size provides some indication of thedegree of viral resistance. Progeny viruses surviving screening aresubject to additional rounds of recombination and screening at increasedstringency until a predetermined level of drug resistance has beenacquired. The predetermined level of drug resistance may reflect themaximum dosage of a drug practical to administer to a patient withoutintolerable side effects. The analysis is particularly valuable forinvestigating acquisition of resistance to various combination of drugs,such as the growing list of approved anti-HIV drugs (e.g., AZT, ddI,ddC, d4T, TIBO 82150, nevaripine, 3TC, crixivan and ritonavir).

4.15.2 Method to Evolve Yeast Strains

Fragments are cloned into a YAC vector, and the resulting YAC library istransformed into competent yeast cells. Transformants containing a YACare identified by selecting for a positive selection marker present onthe YAC. The cells are allowed to recover and are then pooled.Thereafter, the cells are induced to sporulate by transferring the cellsfrom rich medium, to nitrogen and carbon limiting medium. In the courseof sporulation, cells undergo meiosis. Spores are then induced to mateby return to rich media. Optionally, asci are lysed to liberate spores,so that the spores can mate with other spores originating from otherasci. Mating results in recombination between YACs bearing differentinserts, and between YACs and natural yeast chromosomes. The latter canbe promoted by irradiating spores with ultra violet light. Recombinationcan give rise to new phenotypes either as a result of genes expressed byfragments on the YACs or as a result of recombination with host genes,or both.

After induction of recombination between YACs and natural yeastchromosomes, YACs are often eliminated by selecting against a negativeselection marker on the YACs. For example, YACs containing the markerURA3 can be selected against by propagation on media containing 5-fluroorotic acid. Any exogenous or altered genetic material that remains iscontained within natural yeast chromosomes. Optionally, further roundsof recombination between natural yeast chromosomes can be performedafter elimination of YACs. Optionally, the same or different library ofYACs can be transformed into the cells, and the above steps repeated. Byrecursively repeating this process, the diversity of the population isincreased prior to screening.

After elimination of YACs, yeast are then screened or selected for adesired property. The property can be a new property conferred bytransferred fragments, such as production of an antibiotic. The propertycan also be an improved property of the yeast such as improved capacityto express or secrete an exogenous protein, improved recombinogenicity,improved stability to temperature or solvents, or other propertyrequired of commercial or research strains of yeast.

Yeast strains surviving selection/screening are then subject to afurther round of recombination. Recombination can be exclusively betweenthe chromosomes of yeast surviving selection/screening. Alternatively, alibrary of fragments can be introduced into the yeast cells andrecombined with endogenous yeast chromosomes as before. This library offragments can be the same or different from the library used in theprevious round of transformation. For example, the YACs could contain alibrary of genomic DNA isolated from a pool of the improved strainsobtained in the earlier steps. YACs are eliminated as before, followedby additional rounds of recombination and/or transformation with furtherYAC libraries. Recombination is followed by another round ofselection/screening, as above.

Further rounds of recombination/screening can be performed as neededuntil a yeast strain has evolved to acquire the desired property.

An exemplary scheme for evolving yeast by introduction of a YAC libraryis yeast containing an endogenous diploid genome and a YAC library offragments representing variants of a sequence. The library istransformed into the cells to yield 100-1000 colonies per μg DNA. Mosttransformed yeast cells now harbor a single YAC as well as endogenouschromosomes. Meiosis is induced by growth on nitrogen and carbonlimiting medium. In the course of meiosis the YACs recombine with otherchromosomes in the same cell. Haploid spores resulting from meiosis mateand regenerated diploid forms. The diploid forms now harbor recombinantchromosomes, parts of which come from endogenous chromosomes and partsfrom YACs.

Optionally, the YACs can now be cured from the cells by selectingagainst a negative selection marker present on the YACS. Irrespectivewhether YACS are selected against, cells are then screened or selectedfor a desired property. Cells surviving selection/screening aretransformed with another YAC library to start another stochastic &/ornon-stochastic mutagenesis cycle.

4.153 Evolve YACs for Transfer into Recipient Strain

These methods are based in part on the fact that multiple YACs can beharbored in the same yeast cell, and YAC-YAC recombination is known tooccur (Green & Olson, Science 250, 94-98 1990)). Inter-YAC recombinationprovides a format for which families of homologous genes harbored onfragments of >20 kb can be stochastic &/or non-stochastic mutagenized invivo.

The starting population of DNA fragments show sequence similarity witheach other but differ as a result of for example, induced, allelic orspecies diversity. Often DNA fragments are known or suspected to encodemultiple genes that function in a common pathway.

The fragments are cloned into a Yac and transformed into yeast,typically with positive selection for transformants. The transformantsare induced to sporulate, as a result of which chromosomes undergomeiosis. The cells are then mated. Most of the resulting diploid cellsnow carry two YACs each having a different insert. These are againinduced to sporulate and mated. The resulting cells harbor YACs ofrecombined sequence. The cells can then be screened or selected for adesired property. Typically, such selection occurs in the yeast strainused for stochastic &/or non-stochastic mutagenesis. However, iffragments being stochastic &/or non-stochastic mutagenized are notexpressed in yeast, YACs can be isolated and transferred to anappropriate cell type in which they are expressed for screening.Examples of such properties include the synthesis or degradation of adesired compound, increased secretion of a desired gene product, orother detectable phenotype.

Preferably, the YAC library is transformed into haploid a and haploid acells. These cells are then induced to mate with each other, i.e., theyare pooled and induced to mate by growth on rich medium. The diploidcells, each carrying two YACs, are then transferred to sporulationmedium. During sporulation, the cells undergo meiosis, and homologouschromosomes recombine. In this case, the genes harbored in the YACs willrecombine, diversifying their sequences. The resulting haploid acosporesare then liberated from the asci by enzymatic degradation of the asciwall or other available means and the pooled liberated haploid acosporesare induced to mate by transfer to rich medium. This process is repeatedfor several cycles to increase the diversity of the DNA cloned into theYACs. The resulting population of yeast cells, preferably in the haploidstate, are either screened for improved properties, or the diversifiedDNA is delivered to another host cell or organism for screening.

Cells surviving selection/screening are subjected to successive cyclesof pooling, sporulation, mating and selection/screening until thedesired phenotype has been observed. Recombination can be achievedsimply by transferring cells from rich medium to carbon and nitrogenlimited medium to induce sporulation, and then returning the spores torich media to induce mating. Asci can be lysed to stimulate mating ofspores originating from different asci.

After YACs have been evolved to encode a desired property they can betransferred to other cell types. Transfer can be by protoplast fusion,or retransformation with isolated DNA. For example, transfer of YACsfrom yeast to mammalian cells is discussed by Monaco & Larin, Trends inBiotechnology 12, 280-286 (1994); Montoliu et al., Reprod. Fertil. Dev.6, 577-84 (1994); Lamb et al., Curr. Opin. Genet. Dev. 5, 342-8 (1995).An exemplary scheme for stochastic &/or non-stochastic mutagenesis a YACfragment library in yeast is shown herein. A library of YAC fragmentsrepresenting genetic variants are transformed into yeast that havediploid endogenous chromosomes. The transformed yeast continue to havediploid endogenous chromosomes, plus a single YAC. The yeast are inducedto undergo meiosis and sporulate. The spores contain haploid genomes andare selected for those which contain a YAC, using the YAC selectivemarker. The spores are induced to mate generating diploid cells. Thediploid cells now contain two YACs bearing different inserts as well asdiploid endogenous chromosomes. The cells are again induced to undergomeiosis and sporulate. during meiosis, recombination occurs between theYAC inserts, and recombinant YACs are segregated to ascoytes. Someascoytes thus contain haploid endogenous chromosomes plus a YACchromosome with a recombinant insert. The ascoytes mature to spores,which can mate again generating diploid cells. Some diploid cells nowpossess a diploid complement of endogenous chromosomes plus tworecombinant YACs. These cells can then be taken through further cyclesof meiosis, sporulation and mating. In each cycle, further recombinationoccurs between YAC inserts and further recombinant forms of inserts aregenerated. After one or several cycles of recombination has occurred,cells can be tested for acquisition of a desired property. Furthercycles of recombination, followed by selection, can then be performed insimilar fashion.

4.15.4 In Vivo Stochastic &/or Non-Stochastic Mutagenesis of Genes byThe Recursive Mating of Yeast Cells Harboring Homologous Genes inIdentical LOCI

A goal of DNA stochastic &/or non-stochastic mutagenesis is to mimic andexpand the combinatorial capabilities of sexual recombination. In vitroDNA stochastic &/or non-stochastic mutagenesis succeeds in this process.However, by changing the mechanism of recombination and altering theconditions under which recombination occurs, naturally in vitrorecombination methods may jeopardize intrinsic information in a DNAsequence that renders it “evolvable.” Stochastic &/or non-stochasticmutagenesis in vivo by employing the natural crossing over mechanismsthat occur during meiosis may access inherent natural sequenceinformation and provide a means of creating higher quality stochastic&/or non-stochastic mutagenized libraries. Described here is a methodfor the in vivo stochastic &/or non-stochastic mutagenesis of DNA thatutilizes the natural mechanisms of meiotic recombination and provides analternative method for DNA stochastic &/or non-stochastic mutagenesis.

The basic strategy is to clone genes to be stochastic &/ornon-stochastic mutagenized into identical loci within the haploid genomeof yeast. The haploid cells are then recursively induced to mate and tosporulate. The process subjects the cloned genes to recursiverecombination during recursive cycles of meiosis. The resultingstochastic &/or non-stochastic mutagenized genes are then screened in insitu or isolated and screened under different conditions.

For example, if one wished to reassemble a family of five lipase genes,the following provides a means of doing so in vivo.

The open reading frame of each lipase is amplified by the PCR such thateach ORF is flanked by identical 3′ and 5′ sequences. The 5′ flankingsequence is identical to a region within the 5′ coding sequence of theS. cerevisiae ura 3 gene and the 3′ flanking sequence is identical to aregion within the 3′ of the ura 3 gene. The flanking sequences arechosen such that homologous recombination of the PCR product with theura 3 gene results in the incorporation of the lipase gene and thedisruption of the ura 3 ORF. Both S. cerevisiae a and haploid cells arethen transformed with each of the PCR amplified lipase ORFs, and cellshaving incorporated a lipase gene into the ura 3 locus are selected bygrowth on 5 fluoro orotic acid (5FOA is lethal to cells expressingfunctional URA3). The result is 10 cell types, two different matingtypes each harboring one of the five lipase genes in the disrupted ura 3locus. These cells are then pooled and grown under conditions wheremating between the a and cells are favored, e.g. in rich medium. Matingresults in a combinatorial mixture of diploid cells having all 32possible combinations of lipase genes in the two ura 3 loci. The cellsare then induced to sporulate by growth under carbon and nitrogenlimited conditions. During sporulation the diploid cells undergo meiosisto form four (two a and two) haploid ascospores housed in an ascus.

During meiosis II of the sporulation process sister chromatids align andcrossover. The lipase genes cloned into the ura 3 loci will also alignand recombine. Thus the resulting haploid ascospores will represent alibrary of cells each harboring a different possible chimeric lipasegene, each a unique result of the meiotic recombination of the twolipase genes in the original diploid cell. The walls of asci aredegraded by treatment with zymolase to liberate and allow the mixing ofthe individual ascospores. This mixture is then grown under conditionsthat promote the mating of the a and haploid cells. It is important toliberate the individual ascospores, since mating will otherwise occurbetween the ascospores within an ascus.

Mixing of the haploid cells allows recombination between more than twolipase genes, enabling “poolwise recombination.” Mating brings togethernew combinations of chimeric genes that can then undergo recombinationupon sporulation. The cells are recursively cycled through sporulation,ascospore mixing, and mating until sufficient diversity has beengenerated by the recursive pairwise recombination of the five lipasegenes. The individual chimeric lipase genes either can be screeneddirectly in the haploid yeast cells or transferred to an appropriateexpression host.

The process is described above for lipases and yeast; however, anysexual organisms into which genes can be directed can be employed, andany genes, of course, could be substituted for lipases. This process isanalogous to the method of stochastic &/or non-stochastic mutagenesiswhole genomes by recursive pairwise mating. The diversity, however, inthe whole genome case is distributed throughout the host genome ratherthan localized to specific loci.

4.15.5 Using YACs to Clone Unlinked Genes but Functionally ImportantGenes from One Species into Another

Stochastic &/or non-stochastic mutagenesis of YACs is particularlyamenable to transfer of unlinked but functionally related genes from onespecies to another, particularly where such genes have not beenidentified. Such is the case for several commercially important naturalproducts, such as taxol. Transfer of the genes in the metabolic pathwayto a different organism is often desirable because organisms naturallyproducing such compounds are not well suited for mass culturing.

Clusters of such genes can be isolated by cloning a total genomiclibrary of DNA from an organisms producing a useful compound into a YAClibrary. The YAC library is then transformed into yeast. The yeast issporulated and mated such that recombination occurs between YACs and/orbetween YACs and natural yeast chromosomes.

Selection/screening is then performed for expression of the desiredcollection of genes. If the genes encode a biosynthetic pathway,expression can be detected from the appearance of product of thepathway. Production of individual enzymes in the pathway, orintermediates of the final expression product or capacity of cells tometabolize such intermediates indicates partial acquisition of thesynthetic pathway. The original library or a different library can beintroduced into cells surviving/selection screening, and further roundsof recombination and selection/screening can be performed until the endproduct of the desired metabolic pathway is produced.

4.15.6 YAC-YAC Stochastic &/or Non-Stochastic Mutagenesis

If a phenotype of interest can be isolated to a single stretch ofgenomic DNA less than 2 megabases in length, it can be cloned into a YACand replicated in S. cerevisiae. The cloning of similar stretches of DNAfrom related hosts into an identical YAC results in a population ofyeast cells each harboring a YAC having a homologous insert effecting adesired phenotype. The recursive breeding of these yeast cells allowsthe homologous regions of these YACs to recombine during meiosis,allowing genes, pathways, and clusters to recombine during each cycle ofmeiosis. After several cycles of mating and segregation, the YAC insertsare well stochastic &/or non-stochastic mutagenized. The now verydiverse yeast library could then be screened for phenotypic improvementsresulting from the stochastic &/or non-stochastic mutagenesis of the YACinserts.

4.15.7 Yac-Chromosome Stochastic &/or Non-Stochastic Mutagenesis

“Mitotic” recombination occurs during cell division and results from therecombination of genes during replication. This type of recombination isnot limited to that between sister chromatids and can be enhanced byagents that induce recombination machinery, such as nicking chemicalsand ultraviolet irradiation. Since it is often difficult to directlymate across a species barrier, it is possible to induce therecombination of homologous genes originating from different species byproviding the target genes to a desired host organism as a YAC library.The genes harbored in this library are then induced to recombine withhomologous genes on the host chromosome by enhanced mitoticrecombination. This process is carried out recursively to generate alibrary of diverse organisms and then screened for those having thedesired phenotypic improvements. The improved subpopulation is thenmated recursively as above to identify new strains having accumulatedmultiple useful genetic alterations.

4.15.8 Accumulation of Multiple YACs Harboring Useful Genes

The accumulation of multiple unlinked genes that are required for theacquisition or improvement of a given phenotype can be accomplished bythe stochastic &/or non-stochastic mutagenesis of YAC libraries. GenomicDNA from organisms having desired phenotypes, such as ethanol tolerance,thermotolerance, and the ability to ferment pentose sugars are pooled,fragmented and cloned into several different YAC vectors, each having adifferent selective marker (his, ura, ade, etc). S. cerevisiae aretransformed with these libraries, and selected for their presence (usingselective media i.e uracil dropout media for the YAC containing the ura3selective marker) and then screened for having acquired or improved adesired phenotype.

Surviving cells are pooled, mated recursively, and selected for theaccumulation of multiple YACs (by propagation in medium with multiplenutritional dropouts). Cells that acquire multiple YACs harboring usefulgenomic inserts are identified by further screening. Optimized strainscan be used directly, however, due to the burden a YAC may pose to acell, the relevant YAC inserts can be minimized, subcloned, andrecombined into the host chromosome, to generate a more stableproduction strain.

4.15.9 Choice of Host SSF Organism

One example use for the present invention is to create an improved yeastfor the production of ethanol from lignocellulosic biomass.Specifically, a yeast strain with improved ethanol tolerance andthermostability/thermotolerance is desirable. Parent yeast strains knownfor good behavior in a Simultaneous Saccharification and Fermentation(SSF) process are identified. These strains are combined with othersknown to possess ethanol,” tolerance and/or thermostability. S.cerevisiae is highly amenable to development for optimized SSFprocesses. It inherently possesses several traits for this use,including the ability to import and ferment a variety of sugars such assucrose, glucose, galactose, maltose and maltriose. Also, yeast has thecapability to flocculate, enabling recovery of the yeast biomass at theend of a fermentation cycle, and allowing its re-use in subsequentbioprocesses. This is an important property in that it optimizes the useof nutrients in the growth medium. S. cerevisiae is also highly amenableto laboratory manipulation, has highly characterized genetics andpossesses a sexual reproductive cycle. S. cerevisiae may be grown undereither aerobic or anaerobic conditions, in contrast to some otherpotential SSF organisms that are strict anaerobes (e.g. Clostridiumspp.), making them very difficult to handle in the laboratory. S.cerevisiae are also “generally regarded as safe” (“GRAS”), and, due toits widespread use for the production of important comestibles for thegeneral public (e.g. beer, wine, bread, etc), is generally familiar andwell known. S. cerevisiae is commonly used in fermentative processes,and the familiarity in its handling by fermentation experts eases theintroduction of novel improved yeast strains into the industrialsetting. S. cerevisiae strains that previously have been identified asparticularly good SSF organisms, for example, S. cerevisiae D₅A.(ATCC200062) (South C R and Lynd L R. (1994) Appl. Biochem. Biotechnol.45/46: 467-481; Ranatunga T D et al. (1997) Biotechnol. Lett. 19:1125-1127) can be used for starting materials. In addition, otherindustrially used S. cerevisiae strains are optionally used as hoststrains, particularly those showing desirable fermentativecharacteristics, such as S. cerevisiae Y567 (ATCC24858) (Sitton O C etal. (1979) Process Biochem. 14(9): 7-10; Sitton O C et al. (1981) Adv.Biotechnol. 2: 231-237; McMurrough I et al. (1971) Folia Microbiol. 16:346-349) and S. cerevisiae ACA 174 (ATCC 60868) (Benitez T et al. (1983)Appl. Environ. Microbiol. 45: 1429-1436; Chem. Eng. J. 50: B17-B22,1992), which have been shown to have desirable traits for large-scalefermentation.

4.15.10 Choice of Ethanol Tolerant Strains

Many strains of S. cerevisiae have been isolated from high-ethanolenvironments, and have survived in the ethanol-rich environment byadaptive evolution. For example, strains from Sherry wine aging (“Flor”strains) have evolved highly functional mitochondria to enable theirsurvival in a high-ethanol environment. It has been shown that transferof these wine yeast mitochondria to other strains increases therecipient's resistance to high ethanol concentration, as well asthermotolerance (Jimenez, J. and Benitez, T (1988) Curr. Genet. 13:461469). There are several flor strains deposited in the ATCC, forexample S. cerevisiae MY9] (ATCC 201301), MY138 (ATCC 201302), C5 (ATCC201298), ET7 (ATCC 201299), LA6 (ATCC 201300), OSB21 (ATCC 201303), F23(S. globosus ATCC 90920). Also, several flor strains of S. uvarum andTorulaspora pretoriensis have been deposited. Other ethanol-tolerantwine strains include S. cerevisiae ACA 174 (ATCC 60868), 15% ethanol,and S. cerevisiae A54 (ATCC 90921), isolated from wine containing 18%(v/v) ethanol, and NRCC 202036 (ATCC 46534), also a wine yeast. Other S.cerevisiae ethanologens that additionally exhibit enhanced ethanoltolerance include ATCC 24858, ATCC 24858, G 3706 (ATCC 42594), NRRLY-265 (ATCC 60593), and ATCC 24845-ATCC 24860. A strain of S.pastorianus (S. carlsbergensis ATCC 2345) has high ethanol-tolerance(13% v/v). S. cerevisiae Sa28 (ATCC 26603), from Jamaican cane juicesample, produces high levels of alcohol from molasses, is sugartolerant, and produces ethanol from wood acid hydrolyzate. Several ofthe listed strains, as well as additional strains can be used asstarting materials for breeding ethanol tolerance.

4.15.11 Choice of Temperature Tolerant Strains

A few temperature tolerant strains have been reported, including thehighly flocculent strain S. pastorianus SA 23 (S. carlsbergensis ATCC26602), which produces ethanol at elevated temperatures, and S.cerevisiae Kyokai 7 (S. sake, ATCC 26422), a sake yeast tolerant tobrief heat and oxidative stress. Ballesteros et al ((1991) Appl.Biochem. Biotechnol. 28/29: 307-315) examined 27 strains of yeast fortheir ability to grow and ferment glucose in the 32-45° C. temperaturerange, including Saccharomyces, Kluyveromyces and Candida spp. Of these,the best thermotolerant clones were Kluyveromyces marxianus LG andKluyveromyces fragilis 2671 (Ballesteros et al (1993) Appl. Biochem.Biotechnol. 39/40: 201-211). S. cerevisiae-pretoriensis FDHI wassomewhat thermotolerant, however was poor in ethanol tolerance.Recursive recombination of this strain with others that display ethanoltolerance can be used to acquire the thermotolerant characteristics ofthe strain in progeny which also display ethanol tolerance. Candidaacidothernophilum (Issatchenlaa orientalis, ATCC 203 8 1) is a good SSFstrain that also exhibits improved performance in ethanol productionfrom lignocellulosic biomass at higher SSF temperatures than S.cerevisiae D₅A (Kadam, K L, Schmidt, S L (1997) Appl. Microbial.Biotechnol. 48: 709-713). This strain can also be a genetic contributorto an improved SSF strain.

4.15.12 Stochastic &/or Non-Stochastic Mutagenesis of Strains

In those instances where strains are highly related, a recursive matingstrategy may be pursued. For example, a population of haploid S.cerevisiae (a and) are mutagenized and screened for improved EtOH orthermal tolerance. The improved haploid subpopulation are mixed togetherand mated as a pool and induced to sporulate. The resulting haploidspores are fred by degrading the asci wall and mixed. The freed sporesare then induced to mate and sporulate recursively. This process isrepeated a sufficient number of times to generate all possible mutantcombinations. The whole genome stochastic &/or non-stochasticmutagenized population (haploid) is then screened for further EtOH orthermal tolerance.

When strains are not sufficiently related for recursive mating, formatsbased on protoplast fusion may be employed. Recursive and poolwiseprotoplast fusion can be performed to generate chimeric populations ofdiverse parental strains. The resultant pool of progeny is selected andscreened to identify improved ethanol and thermal tolerant strains.

Alternatively, a YAC-based Whole Genome Stochastic &/or non-stochasticmutagenesis format can be used. In this format, YACs are used to shuttlelarge chromosomal fragments between strains. As detailed earlier,recombination occurs between YACs or between YACs, and the hostchromosomes. Genomic DNA from organisms having desired phenotypes arepooled, fragmented and cloned into several different YAC vectors, eachhaving a different selective marker (his, ura, ade, etc). S. cerevisiaeare transformed with these libraries, and selected for their presence(using selective media, i.e. uracil dropout media for the YAC containingthe Ura 3 selective marker) and then screened for having acquired orimproved a desired phenotype. Surviving cells are pooled, matedrecursively (as above), and selected for the accumulation of multipleYACs (by propagation in medium with multiple nutritional dropouts).Cells that acquire multiple YACs harboring useful genomic inserts areidentified by further screening (see below).

4.15.13 Selection for Improved Strains

Having produced large libraries of novel strains by mutagenesis andrecombination, a first task is to isolate those strains that possessimprovements in the desired phenotypes. Identification of the organismlibraries is facilitated where the desired key traits are selectablephenotypes. For example, ethanol has different effects on the growthrate of a yeast population, viability, and fermentation rate. Inhibitionof cell growth and viability increases with ethanol concentration, buthigh fermentative capacity is only inhibited at higher ethanolconcentrations. Hence, selection of growing cells in ethanol is a viableapproach to isolate ethanol-tolerant strains. Subsequently, the selectedstrains may be analyzed for their fermentative capacity to produceethanol. Provided that growth and media conditions are the same for allstrains (parents and progeny), a hierarchy of ethanol tolerance may beconstructed. Simple selection schemes for identification of thermaltolerant and ethanol tolerant strains are available and, in this case,are based on those previously designed to identify potentially usefulSSF strains. Selection of ethanol tolerance is performed by exposing thepopulation to ethanol, then plating the population and looking forgrowth. Colonies capable of growing after exposure to ethanol can bere-exposed to a higher concentration of ethanol and the cycle repeateduntil the most tolerant strains are selected. In order to discernstrains possessing heritable ethanol tolerance from with temporarilyacquired adaptations, these cycles may be punctuated with cycles ofgrowth in the absence of selection (e.g. no ethanol).

Alternatively, the mixed population can be grown directly at increasingconcentrations of ethanol, and the most tolerant strains enriched(Aguilera and Benitez, 1986, Arch Microbiol 4: 33744). For example thisenrichment could be carried out in a chemostat or turbidostat. Similarselections can be developed for thermal tolerance, in which strains areidentified by their ability to grow after a heat treatment, or directlyfor growth at elevated temperatures (Ballesteros et al., 1991, AppliedBiochem and Biotech, 28: 307-315). The best strains identified by theseselections will be assayed more thoroughly in subsequent screens forethanol, thermal tolerance or other properties of interest.

In one aspect, organisms having increased ethanol tolerance are selectedfor. A population of natural S. cerevisiae isolates are mutagenized.This population is then grown under fermentor conditions under lowinitial ethanol concentrations. Once the culture has reached saturation,the culture is diluted into fresh medium having a slightly higherethanol content. This process of successive dilution into medium ofincrementally increasing ethanol concentration is continued until athreshold of ethanol tolerance is reached. The surviving mutantpopulation having the highest ethanol tolerance are then pooled andtheir genomes recombined by any method noted herein. Enrichment couldalso be achieved by a continuous culture in a chemostat or turbidostatin which temperature or ethanol concentrations are progressivelyelevated. The resulting stochastic &/or non-stochastic mutagenizedpopulation are then exposed once again to the enrichment strategy but ata higher starting medium ethanol concentration. This strategy isoptionally applied for the enrichment of thermotolerant cells and forthe enrichment of cells having combined thermo- and ethanol tolerance.

4.15.14 Screening for Improved Strains

Strains showing viability in initial selections are assayed morequantitatively for improvements in the desired properties before beingrestochastic &/or non-stochastic mutagenized with other strains.

Progeny resulting from mutagenesis of a strain, or those pre-selectedfor their ethanol tolerance and/or thermostability, can be plated onnon-selective agar. Colonies can be picked robotically into microtiterdishes and grown. Cultures are replicated to fresh microtiter plates,and the replicates are incubated under the appropriate stresscondition(s). The growth or metabolic activity of individual clones maybe monitored and ranked. Indicators of viability can range from the sizeof growing colonies on solid media, density of growing cultures, orcolor change of a metabolic activity indicator added to liquid media.Strains that show the greatest viability are then mixed and stochastic&/or non-stochastic mutagenized, and the resulting progeny arerescreened under more stringent conditions.

4.15.15 Development of a Yeast Strain Capable of Converting Cellulose toMonomeric Sugars

Once a strain of yeast exhibiting thermotolerance and ethanol toleranceis developed, the degradation of cellulose to monomeric sugars isprovided by the inclusion to the host strain of an efficient cellulasedegradation pathway.

Additional desirable characteristic can be useful to enhance theproduction of ethanol by the host. For example, inclusion ofheterologous enzymes and pathways that broaden the substrate sugar rangemay be performed. “Tuning” of the strain can be accomplished by theaddition of various other traits, or the restoration of certainendogenous traits that are desirable, but lost during the recombinationprocedures.

4.15.16 Conferring of Cellulase Activity

A vast number of cellulases and cellulase degradation systems have beencharacterized from fungi, bacteria and yeast (see reviews by Beguin, Pand Aubert, J-P (1994) FEMS Microbial. Rev. 13: 25-58; Ohima, K. et al.(1997) Biotechnol. Genet. Eng. Rev. 14: 365414). An enzymatic pathwayrequired for efficient saccharification of cellulose involves thesynergistic action of endoglucanases (endo-1,4-D-glucanases, EC3.2.1.4), exocellobiohydrolases (exo-1,4-D-glucanases, EC 3.2.1.91), and-glucosidases (cellobiases, 1,4-D-glucanases EC 3.2.1.21). Theheterologous production of cellulase enzymes in the ethanologen wouldenable the saccharification of cellulose, producing monomeric sugarsthat may be used by the organism for ethanol production. There areseveral advantages to the heterologous expression of a functionalcellulase pathway in the ethanologen. For example, the SSF process wouldeliminate the need for a separate bioprocess step for saccharification,and would ameliorate end-product inhibition of cellulase enzymes byaccumulated intermediate and product sugars.

Naturally occurring cellulase pathways are inserted into theethanologen, or one may choose to use custom improved “hybrid” cellulasepathways, employing the coordinate action of cellulases derived fromdifferent natural sources, including thermophiles.

Several cellulases from non-Saccharomyces have been produced andsecreted from this organism successfully, including bacterial, fungal,and yeast enzymes, for example T. reesei CBH I ((Shoemaker (1994), in“The Cellulase System of Trichoderma reesei: Trichoderma strainimprovement and Expression of Trichoderma cellulases in Yeast,” Online,Pinner, UK, 593-600). It is possible to employ straightforward metabolicengineering techniques to engender cellulase activity in Saccharomyces.Also, yeast have been forced to acquire elements of cellulosedegradation pathways by protoplast fusion (e.g. intergeneric hybrids ofSaccharomyces cerevisiae and Zygosaccharomyces fermentati, acellobiase-producing yeast, have been created (Pina A, et. al. (1986)Appl. Environ. Microbial. 51: 995-1003). In general, any cellulasecomponent enzyme that derives from a closely related yeast organismcould be transferred by protoplast fusion. Cellobiases produced by asomewhat broader range of yeast may be accessed by whole genomestochastic &/or non-stochastic mutagenesis in one of its many formats(e.g. whole, fragmented, YAC-based).

Optimally, the cellulase enzymes to be used should exhibit good synergy,an appropriate level of expression and secretion from the host, goodspecific activity (i.e. resistance to host degradation factors andenzyme modification) and stability in the desired SSF environment. Anexample of a hybrid cellulose degradation pathway having excellentsynergy includes the following enzymes: CBH I exocellobiohydrolase ofTrichoderma reesei, the Acidothermus cellulolyticus E1 endoglucanase,and the Thermomonospera fusca E3 exocellulase (Baker, et. al. (1998)Appl. Biochem. Biotechnol. 70-72: 395403). It is suggested here thatthese enzymes (or improved mutants thereof) be considered for use in theSSF organism, along with a cellobiase (-glucosidase), such as that fromCandida peltata. Other possible cellulase systems to be consideredshould possess particularly good activity against crystalline cellulose,such as the T. reesei cellulase system (Teeri, T T, et. al. (1998)Biochem. Soc, Trans. 26: 173-178), or possess particularly goodthermostability characteristics (e.g. cellulase systems fromthermophilic organisms, such as Thermomonospora fusca (Zhang, S., et.al. (1995) Biochem. 34: 3 3 86-3 3 5).

A rational approach to the cloning of cellulases in the ethanologenicyeast host could be used. For example, known cellulase genes are clonedinto expression cassettes utilizing S. cerevisiae promoter sequences,and the resultant linear fragments of DNA may be transformed into therecipient host by placing short yeast sequences at the termini toencourage site-specific integration into the genome. This is preferredto plasmidic transformation for reasons of genetic stability andmaintenance of the transforming DNA.

If an entire cellulose degradative pathway were introduced, a selectioncould be implemented in an agar-plate-based format, and a large numberof clones could be assayed for cellulase activity in a short period oftime. For example, selection for an exocellulase may be accessible byproviding a soluble oligocellulose substrate or carboxymethylcellulose(CMC) as a sole carbon source to the host, otherwise unable to grow onagar containing this sole carbon source. Clones producing activecellulase pathways would grow by virtue of their ability to produceglucose.

Alternatively, if the different cellulases were to be introducedsequentially, it would be useful to first introduce a cellobiase,enabling a selection using commercially available cellobiose as a solecarbon source. Several strains of S. cerevisiae that are able to grow oncellobiose have been created by introduction of a cellobiase gene (e.g.Rajoka M I, et. al. (1998) Floia Microbiol. (Praha) 43, 129-135; Skory,C D, et. al. (1996) Curr. Genet. 30, 417422; D'Auria, S, et. al. (1996)Appl. Biochem. Biotechnol. 61, 157-166; Adam, A C, et. al. (1995) Yeast11, 395-406; Adam, AC (1991) Cuff. Genet. 20, 5-8).

Subsequent transformation of this organism with CBHI exocellulase can beselected for by growth on a cellulose substrate such ascarboxymethylcellulose (CMC). Finally, addition of an endoglucanasecreates a yeast strain with improved crystalline degradation capacity.

4.15.17 Conferring of Pentose Sugar Utilization

Inclusion of pentose sugar utilization pathways is an important facet toa potentially useful SSF organism. The successful expression of xylosesugar utilization pathways for ethanol production has been reported inSaccharomyces (e.g. Chen, Z D and Ho, N W Y (1993) Appl. Biochem.Biotechnol. 39/40 135-147).

It would also be useful to accomplish L-arabinose substrate utilizationfor ethanol production in the Saccharomyces host. Yeast strains thatutilize L-arabinose include some Candida and Pichia spp. (McMillan J Dand Boynton B L (1994) Appl. Biochem. Biotechnol. 4546: 569-584; Dien BS, et al. (1996) Appl. Biochem. Biotechnol. 57-58: 233-242). Genesnecessary for arabinose fermentation in E. coli could also be introducedby rational means (e.g. as has been performed previously in Z. mobilis(Deanda K, et. al. (1996) Appl. Environ. Microbial. 62: 4465-4470)).

4.15.18 Conferring of Other Useful Activites

Several other traits that are important for optimization of an SSFstrain have been shown to be transferable to S. cerevisiae. Like thermaltolerance, cellulase activity and pentose sugar utilization, thesetraits may not normally be exhibited by Saccharomyces (or the particularstrain of Saccharomyces being used as a host), and may be added bygenetic means.

For example, expression of human muscle acylphosphatase in S. cerevisiaehas been suggested to increase ethanol production (Rougei, G., et. al.(1996) Biotechnol. App. Biochem. 23: 273-278).

It can occur that evolved stress-tolerant SSF strain acquire someundesirable mutations in the course of the evolution strategy. Indeed,this is a pervasive problem in strain improvement strategies that relyon mutagenesis techniques, and can result in highly unstable or fragileproduction strains. It is possible to restore some of these desirabletraits by rational methods such as cloning of specific genes that havebeen knocked out or negatively influenced in the previous rounds ofstrain improvement. The advantage to this approach is specificity—theoffending gene may be targeted directly. The disadvantage is that it maybe time-consuming and repetitious if several genes have beencompromised, and it only addresses problems that have beencharacterized. A preferred (and more traditional) approach to theremoval of undesirable/deleterious mutations is to back-cross theevolved strain to a desirable parent strain (e.g. the original “host”SSF strain). This strategy has been employed successfully throughoutstrain improvement where accessible (i.e. for organisms that have sexualcycles of reproduction). When lacking the advantage of a sexual process,it has been accomplished by using other methods, such as parasexualrecombination or protoplast fusion.

For example, the ability to flocculate was conferred on anon-flocculating strain of S. cerevisiae by protoplast fusion with aflocculation competent S. cerevisiae (Watari, J., et. al (1990) Agric.Biol. Chem. 54: 1677-1681).

4.16 Method of In Vivo and In Vitro DNA Shuffling

4.16.1 Applications

Disclosed is a method of producing random polynucleotides by introducingtwo or more related polynucleotides into a suitable host cell such thata hybrid polynucleotide is generated by recombination and reductivereassortment. Also provided are vector and expression vehicles includingsuch polynucleotides, polypeptides expressed by the hybridpolynucleotides and a method for screening for hybrid polypeptides.

4.16.2 Experimental Applications

This invention relates generally to recombination and more specificallyto a method for preparing polynucleotides encoding a polypeptide by amethod of in vivo re-assortment of polynucleotide sequences containingregions of partial homology, assembling the polynucleotides to form atleast one polynucleotide and screening the polynucleotides for theproduction of polypeptide(s) having a useful property.

4.16.3 History

An exceedingly large number of possibilities exist for purposeful andrandom combinations of amino acids within a protein to produce usefulhybrid proteins and their corresponding biological molecules encodingfor these hybrid proteins, i.e., DNA, RNA. Accordingly, there is a needto produce and screen a wide variety of such hybrid proteins for auseful utility, particularly widely varying random proteins.

The complexity of an active sequence of a biological macromolecule(e.g., proteins, DNA) has been called its information content (“IC”),which has been defined as the resistance of the active protein to aminoacid sequence variation (calculated from the minimum number ofinvariable amino acids (bits) required to describe a family of relatedsequences with the same function. Proteins that are more sensitive torandom mutagenesis have a high information content.

Molecular biology developments, such as molecular libraries, haveallowed the identification of quite a large number of variable bases,and even provide ways to select functional sequences from randomlibraries. In such libraries, most residues can be varied (althoughtypically not all at the same time) depending on compensating changes inthe context. Thus, while a 100 amino acid protein can contain only 2,000different mutations, 20100 sequence combinations are possible.

Information density is the IC per unit length of a sequence. Activesites of enzymes tend to have a high information density. By contrast,flexible linkers of information in enzymes have a low informationdensity.

Current methods in widespread use for creating alternative proteins in alibrary format are error-prone polymerase chain reactions and cassettemutagenesis, in which the specific region to be optimized is replacedwith a synthetically mutagenized oligonucleotide. In both cases, asubstantial number of mutant sites are generated around certain sites inthe original sequence.

4.163.1 Error-Prone PCR

Error-prone PCR uses low-fidelity polymerization conditions to introducea low level of point mutations randomly over a long sequence. In amixture of fragments of unknown sequence, error-prone PCR can be used tomutagenize the mixture. The published error-prone PCR protocols sufferfrom a low processivity of the polymerase. Therefore, the protocol isunable to result in the random mutagenesis of an average-sized gene.This inability limits the practical application of error-prone PCR. Somecomputer simulations have suggested that point mutagenesis alone mayoften be too gradual to allow the large-scale block changes that arerequired for continued and dramatic sequence evolution. Further, thepublished error-prone PCR protocols do not allow for amplification ofDNA fragments greater than 0.5 to 1.0 kb, limiting their practicalapplication. In addition, repeated cycles of error-prone PCR can lead toan accumulation of neutral mutations with undesired results, such asaffecting a protein's immunogenicity but not its binding affinity.

4.16.3.2 Oligonucleotide-Directed Mutagenesis

In oligonucleotide-directed mutagenesis, a short sequence is replacedwith a synthetically mutagenized oligonucleotide. This approach does notgenerate combinations of distant mutations and is thus notcombinatorial. The limited library size relative to the vast sequencelength means that many rounds of selection are unavoidable for proteinoptimization. Mutagenesis with synthetic oligonucleotides requiressequencing of individual clones after each selection round followed bygrouping them into families, arbitrarily choosing a single family, andreducing it to a consensus motif. Such motif is resynthesized andreinserted into a single gene followed by additional selection. Thisstep process constitutes a statistical bottleneck, is labor intensive,and is not practical for many rounds of mutagenesis.

Error-prone PCR and oligonucleotide-directed mutagenesis are thus usefulfor single cycles of sequence fine tuning, but rapidly become toolimiting when they are applied for multiple cycles.

Another limitation of error-prone PCR is that the rate of down-mutationsgrows with the information content of the sequence. As the informationcontent, library size, and mutagenesis rate increase, the balance ofdown-mutations to up-mutations will statistically prevent the selectionof further improvements (statistical ceiling).

4.16.3.3 Cassette Mutagenesis

In cassette mutagenesis, a sequence block of a single template istypically replaced by a (partially) randomized sequence. Therefore, themaximum information content that can be obtained is statisticallylimited by the number of random sequences (ie., library size). Thiseliminates other sequence families which are not currently best, butwhich may have greater long term potential.

Also, mutagenesis with synthetic oligonucleotides requires sequencing ofindividual clones after each selection round. Thus, such an approach istedious and impractical for many rounds of mutagenesis.

Thus, error-prone PCR and cassette mutagenesis are best suited, and havebeen widely used, for fine-tuning areas of comparatively low informationcontent. One apparent exception is the selection of an RNA ligaseribozyme from a random library using many rounds of amplification byerror-prone PCR and selection.

In nature, the evolution of most organisms occurs by natural selectionand sexual reproduction. Sexual reproduction ensures mixing andcombining of the genes in the offspring of the selected individuals.During meiosis, homologous chromosomes from the parents line up with oneanother and cross-over part way along their length, thus randomlyswapping genetic material. Such swapping or shuffling of the DNA allowsorganisms to evolve more rapidly.

In recombination, because the inserted sequences were of proven utilityin a homologous environment, the inserted sequences are likely to stillhave substantial information content once they are inserted into the newsequence.

4.16.3.4 Applied Molecular Evolution

The term Applied Molecular Evolution (“AME”) means the application of anevolutionary design algorithm to a specific, useful goal. While manydifferent library formats for AME have been reported forpolynucleotides, peptides and proteins (phage, lad and polysomes), noneof these formats have provided for recombination by random crossovers todeliberately create a combinatorial library.

Theoretically there are 2,000 different single mutants of a 100 aminoacid protein. However, a protein of 100 amino acids has 20¹⁰⁰ possiblesequence combinations, a number which is too large to exhaustivelyexplore by conventional methods. It would be advantageous to develop asystem which would allow generation and screening of all of thesepossible combination mutations.

4.16.3.5 Reported In Vivo Recombination Systems

Some workers in the art have utilized an in vivo site specificrecombination system to generate hybrids of combine light chain antibodygenes with heavy chain antibody genes for expression in a phage system.However, their system relies on specific sites of recombination and islimited accordingly. Simultaneous mutagenesis of antibody CDR regions insingle chain antibodies (scFv) by overlapping extension and PCR havebeen reported.

Others have described a method for generating a large population ofmultiple hybrids using random in vivo recombination. This methodrequires the recombination of two different libraries of plasmids, eachlibrary having a different selectable marker. The method is limited to afinite number of recombinations equal to the number of selectablemarkers existing, and produces a concomitant linear increase in thenumber of marker genes linked to the selected sequence(s).

In vivo recombination between two homologous, but truncated,insect-toxin genes on a plasmid has been reported as a method ofproducing a hybrid gene. The in vivo recombination of substantiallymismatched DNA sequences in a host cell having defective mismatch repairenzymes, resulting in hybrid molecule formation has been reported.

4.16.4 Strategies

In one aspect this invention provides a method that utilizes the naturalproperty of cells to recombine molecules and/or to mediate reductiveprocesses that reduce the complexity of sequences and extent of repeatedor consecutive sequences possessing regions of homology.

It is an object of the present invention to provide a method forgenerating hybrid polynucleotides encoding biologically active hybridpolypeptides with enhanced activities. In accomplishing these and otherobjects, there has been provided, in accordance with one aspect of theinvention, a method for introducing polynucleotides into a suitable hostcell and growing the host cell under conditions which produce a hybridpolynucleotide.

In another aspect of the invention, the invention provides a method forscreening for biologically active hybrid polypeptides encoded by hybridpolynucleotides. The present method allows for the identification ofbiologically active hybrid polypeptides with enhanced biologicalactivities.

Other objects, features and advantages of the present invention willbecome apparent from the following detailed description. It should beunderstood, however, that the detailed description and the specificexamples, while indicating preferred embodiments of the invention, aregiven by way of illustration only, since various changes andmodifications within the spirit and scope of the invention will becomeapparent to those skilled in the art from this detailed description.

4.16.5 Possible Uses

The invention described herein is directed to the use of repeated cyclesof reductive reassortment, recombination and selection which allow forthe directed molecular evolution of highly complex linear sequences,such as DNA, RNA or proteins thorough recombination.

In vivo shuffling of molecules can be performed utilizing the naturalproperty of cells to recombine multimers. While recombination in vivohas provided the major natural route to molecular diversity, geneticrecombination remains a relatively complex process that involves 1) therecognition of homologies; 2) strand cleavage, strand invasion, andmetabolic steps leading to the production of recombinant chiasma; andfinally 3) the resolution of chiasma into discrete recombined molecules.The formation of the chiasma requires the recognition of homologoussequences.

4.16.5.1 Production of a Hybrid Polynucleotide

In a preferred embodiment, the invention relates to a method forproducing a hybrid polynucleotide from at least a first polynucleotideand a second polynucleotide. The present invention can be used toproduce a hybrid polynucleotide by introducing at least a firstpolynucleotide and a second polynucleotide which share at least oneregion of partial sequence homology into a suitable host cell. Theregions of partial sequence homology promote processes which result insequence reorganization producing a hybrid polynucleotide. The term“hybrid polynucleotide”, as used herein, is any nucleotide sequencewhich results from the method of the present invention and containssequence from at least two original polynucleotide sequences. Suchhybrid polynucleotides can result from intermolecular recombinationevents which promote sequence integration between DNA molecules. Inaddition, such hybrid polynucleotides can result from intramolecularreductive reassortment processes which utilize repeated sequences toalter a nucleotide sequence within a DNA molecule.

The invention provides a means for generating hybrid polynucleotideswhich may encode biologically active hybrid polypeptides. In one aspect,the original polynucleotides encode biologically active polypeptides.The method of the invention produces new hybrid polypeptides byutilizing cellular processes which integrate the sequence of theoriginal polynucleotides such that the resulting hybrid polynucleotideencodes a polypeptide demonstrating activities derived from the originalbiologically active polypeptides. For example, the originalpolynucleotides may encode a particular enzyme from differentmicroorganisms. An enzyme encoded by a first polynucleotide from oneorganism may, for example, function effectively under a particularenvironmental condition, e.g. high salinity. An enzyme encoded by asecond polynucleotide from a different organism may function effectivelyunder a different environmental condition, such as extremely hightemperatures. A hybrid polynucleotide containing sequences from thefirst and second original polynucleotides may encode an enzyme whichexhibits characteristics of both enzymes encoded by the originalpolynucleotides. Thus, the enzyme encoded by the hybrid polynucleotidemay function effectively under environmental conditions shared by eachof the enzymes encoded by the first and second polynucleotides, e.g.,high salinity and extreme temperatures.

4.16.5.1.1 Encoded Enzymes

Enzymes encoded by the original polynucleotides of the inventioninclude, but are not limited to; oxidoreductases, transferases,hydrolases, lyases, isomerases and ligases. A hybrid polypeptideresulting from the method of the invention may exhibit specializedenzyme activity not displayed in the original enzymes. For example,following recombination and/or reductive reassortment of polynucleotidesencoding hydrolase activities, the resulting hybrid polypeptide encodedby a hybrid polynucleotide can be screened for specialized hydrolaseactivities obtained from each of the original enzymes, i.e. the type ofbond on which the hydrolase acts and the temperature at which thehydrolase functions. Thus, for example, the hydrolase may be screened toascertain those chemical functionalities which distinguish the hybridhydrolase from the original hydrolyases, such as: (a) amide (peptidebonds), i.e. proteases; (b) ester bonds, i.e. esterases and lipases; (c)acetals, ie., glycosidases and, for example, the temperature, pH or saltconcentration at which the hybrid polypeptide functions.

4.16.5.1.2 Sources of the Original Polynucleotides

Sources of the original polynucleotides may be isolated from individualorganisms (“isolates”), collections of organisms that have been grown indefined media (“enrichment cultures”), or, most preferably, uncultivatedorganisms (“environmental samples”). The use of a culture-independentapproach to derive polynucleotides encoding novel bioactivities fromenvironmental samples is most preferable since it allows one to accessuntapped resources of biodiversity.

“Environmental libraries” are generated from environmental samples andrepresent the collective genomes of naturally occurring organismsarchived in cloning vectors that can be propagated in suitableprokaryotic hosts. Because the cloned DNA is initially extracteddirectly from environmental samples, the libraries are not limited tothe small fraction of prokaryotes that can be grown in pure culture.Additionally, a normalization of the environmental DNA present in thesesamples could allow more equal representation of the DNA from all of thespecies present in the original sample. This can dramatically increasethe efficiency of finding interesting genes from minor constituents ofthe sample which may be under-represented by several orders of magnitudecompared to the dominant species.

For example, gene libraries generated from one or more uncultivatedmicroorganisms are screened for an activity of interest. Potentialpathways encoding bioactive molecules of interest are first captured inprokaryotic cells in the form of gene expression libraries.Polynucleotides encoding activities of interest are isolated from suchlibraries and introduced into a host cell. The host cell is grown underconditions which promote recombination and/or reductive reassortmentcreating potentially active biomolecules with novel or enhancedactivities.

The microorganisms from which the polynucleotide may be prepared includeprokaryotic microorganisms, such as Eubacteria and Archaebacteria, andlower eukaryotic microorganisms such as fungi, some algae and protozoa.Polynucleotides may be isolated from environmental samples in which casethe nucleic acid may be recovered without culturing of an organism orrecovered from one or more cultured organisms. In one aspect, suchmicroorganisms may be extremophiles, such as hyperthermophiles,psychrophiles, psychrotrophs, halophiles, barophiles and acidophiles.Polynucleotides encoding enzymes isolated from extremophilicmicroorganisms are particularly preferred. Such enzymes may function attemperatures above 1001C in terrestrial hot springs and deep sea thermalvents, at temperatures below 0° C. in arctic waters, in the saturatedsalt environment of the Dead Sea, at pH values around 0 in coal depositsand geothermal sulfur-rich springs, or at pH values greater than 11 insewage sludge. For example, several esterases and lipases cloned andexpressed from extremophilic organisms show high activity throughout awide range of temperatures and pHs.

4.16.5.13 Suitable Host Cells

Polynucleotides selected and isolated as hereinabove described areintroduced into a suitable host cell. A suitable host cell is any cellwhich is capable of promoting recombination and/or reductivereassortment. The selected polynucleotides are preferably already in avector which includes appropriate control sequences. The host cell canbe a higher eukaryotic cell, such as a mammalian cell, or a lowereukaryotic cell, such as a yeast cell, or preferably, the host cell canbe a prokaryotic cell, such as a bacterial cell. Introduction of theconstruct into the host cell can be effected by calcium phosphatetransfection, DEAE-Dextran mediated transfection, or electroporation(Davis, L., Dibner, M., Battey, I., Basic Methods in Molecular Biology,(1986)).

As representative examples of appropriate hosts, there may be mentioned:bacterial cells, such as E. coli, Streptomyces, Salmonella typhimurium;fungal cells, such as yeast; insect cells such as Drosophila S2 andSpodoptera SJ9; animal cells such as CHO, COS or Bowes melanoma;adenoviruses; and plant cells. The selection of an appropriate host isdeemed to be within the scope of those skilled in the art from theteachings herein.

4.16.5.1.3.1 Mammalian Cell Culture Systems

With particular references to various mammalian cell culture systemsthat can be employed to express recombinant protein, examples ofmammalian expression systems include the COS-7 lines of monkey kidneyfibroblasts, described by Gluzman, Cell, 23: 175 (1981), and other celllines capable of expressing a compatible vector, for example, the C 127,3T3, CHO, HeLa and BHK cell lines. Mammalian expression vectors willcomprise an origin of replication, a suitable promoter and enhancer, andalso any necessary ribosome binding sites, polyadenylation site, splicedonor and acceptor sites, transcriptional termination sequences, and 5′flanking nontranscribed sequences. DNA sequences derived from the SV40splice, and polyadenylation sites may be used to provide the requirednontranscribed genetic elements.

Host cells containing the polynucleotides of interest can be cultured inconventional nutrient media modified as appropriate for activatingpromoters, selecting transformants or amplifying genes. The cultureconditions, such as temperature, pH and the like, are those previouslyused with the host cell selected for expression, and will be apparent tothe ordinarily skilled artisan. The clones which are identified ashaving the specified enzyme activity may then be sequenced to identifythe polynucleotide sequence encoding an enzyme having the enhancedactivity.

4.16.5.1.4 Generation of Polynucleotides Encoding Biochemical Pathways

In another aspect, it is envisioned the method of the present inventioncan be used to generate novel polynucleotides encoding biochemicalpathways from one or more operons or gene clusters or portions thereof.For example, bacteria and many eukaryotes have a coordinated mechanismfor regulating genes whose products are involved in related processes.The genes are clustered, in structures referred to as “gene clusters,”on a single chromosome and are transcribed together under the control ofa single regulatory sequence, including a single promoter whichinitiates transcription of the entire cluster. Thus, a gene cluster is agroup of adjacent genes that are either identical or related, usually asto their function. An example of a biochemical pathway encoded by geneclusters are polyketides. Polyketides are molecules which are anextremely rich source of bioactivities, including antibiotics (such astetracyclines and erythromycin), anti-cancer agents (daunomycin),immunosuppressants (FK506 and rapamycin), and veterinary products(monensin). Many polyketides (produced by polyketide synthases) arevaluable as therapeutic agents. Polyketide synthases are multifunctionalenzymes that catalyze the biosynthesis of an enormous variety of carbonchains differing in length and patterns of functionality andcyclization. Polyketide synthase genes fall into gene clusters and atleast one type (designated type 1) of polyketide synthases have largesize genes and enzymes, complicating genetic manipulation and in vitrostudies of these genes/proteins.

The ability to select and combine desired components from a library ofpolyketides, or fragments thereof, and postpolyketide biosynthesis genesfor generation of novel polyketides for study is appealing. The methodof the present invention makes it possible to facilitate the productionof novel polyketide synthases through intermolecular recombination.

4.16.5.1.5 Gene Cluster DNA

Preferably, gene cluster DNA can be isolated from different organismsand ligated into vectors, particularly vectors containing expressionregulatory sequences which can control and regulate the production of adetectable protein or protein-related array activity from the ligatedgene clusters. Use of vectors which have an exceptionally large capacityfor exogenous DNA introduction are particularly appropriate for use withsuch gene clusters and are described by way of example herein to includethe f-factor (or fertility factor) of E. coli. This f-factor of E. coliis a plasmid which affect high-frequency transfer of itself duringconjugation and is ideal to achieve and stably propagate large DNAfragments, such as gene clusters from mixed microbial samples. Onceligated into an appropriate vector, two or more vectors containingdifferent polyketide synthase gene clusters can be introduced into asuitable host cell. Regions of partial sequence homology shared by thegene clusters will promote processes which result in sequencereorganization resulting in a hybrid gene cluster. The novel hybrid genecluster can then be screened for enhanced activities not found in theoriginal gene clusters.

Therefore, in a preferred embodiment, the present invention relates to amethod for producing a biologically active hybrid polypeptide andscreening such a polypeptide for enhanced activity by:

-   -   introducing at least a first polynucleotide in operable linkage        and a second polynucleotide in operable linkage, said at least        first polynucleotide and second polynucleotide sharing at least        one region of partial sequence homology, into a suitable host        cell;    -   growing the host cell under conditions which promote sequence        reorganization resulting in a hybrid polynucleotide in operable        linkage;    -   expressing a hybrid polypeptide encoded by the hybrid        polynucleotide;    -   screening the hybrid polypeptide under conditions which promote        identification of enhanced biological activity; and    -   isolating the a polynucleotide encoding the hybrid polypeptide.

Methods for screening for various enzyme activities are known to thoseof skill in the art and discussed throughout the present specification.Such methods may be employed when isolating the polypeptides andpolynucleotides of the present invention.

The term “isolated” means that material is removed from its originalenvironment (e.g., the natural environment if it is naturallyoccurring). For example, a naturally-occurring polynucleotide orpolypeptide present in a living animal is not isolated, but the samepolynucleotide or polypeptide separated from some or all of thecoexisting materials in the natural system, is isolated.

As used herein, the term “operably linked” refers to a linkage ofpolynucleotide elements in a functional relationship. A nucleic acid is“operably linked” when it is placed into a functional relationship withanother nucleic acid sequence. For instance, a promoter or enhancer isoperably linked to a coding sequence if it affects the transcription ofthe coding sequence. Operably linked means that the DNA sequences beinglinked are typically contiguous and, where necessary to join'two proteincoding regions, contiguous and in reading frame.

4.16.5.1.6 Expression Vectors

As representative examples of expression vectors which may be used theremay be mentioned viral particles, baculovirus, phage, plasmids,phagemids, cosmids, fosmids, bacterial artificial chromosomes, viral DNA(e.g. vaccinia, adenovirus, foul pox virus, pseudorabies and derivativesof SV40), P1-based artificial chromosomes, yeast plasmids, yeastartificial chromosomes, and any other vectors specific for specifichosts of interest (such as bacillus, aspergillus and yeast) Thus, forexample, the DNA may be included in any one of a variety of expressionvectors for expressing a polypeptide. Such vectors include chromosomal,nonchromosomal and synthetic DNA sequences. Large numbers of suitablevectors are known to those of skill in the art, and are commerciallyavailable. The following vectors are provided by way of example;Bacterial: pQE vectors (Qiagen), pBluescript plasmids, pNH vectors,(lambda-Z^(A)P vectors (Stratagene); ptrc99a, pKK223-3, pDR540, pRIT2T(Pharmacia); Eukaryotic: pXT1, pSG5 (Stratagene), pSVK3, pBPV, pMSG,pSVLSV40 (Pharmacia). However, any other plasmid or other vector may beused as long as they are replicable and viable in the host. Low copynumber or high copy number vectors may be employed with the presentinvention.

A preferred type of vector for use in the present invention contains anf-factor origin replication. The f-factor (or fertility factor) in E.coli is a plasmid which effects high frequency transfer of itself duringconjugation and less frequent transfer of the bacterial chromosomeitself. A particularly preferred embodiment is to use cloning vectors,referred to as “fosmids” or bacterial artificial chromosome (BAC)vectors. These are derived from E. coli f-factor which is able to stablyintegrate large segments of genomic DNA. When integrated with DNA from amixed uncultured environmental sample, this makes it possible to achievelarge genomic fragments in the form of a stable “environmental DNAlibrary.”

Another preferred type of vector for use in the present invention is acosmid vector. Cosmid vectors were originally designed to clone andpropagate large segments of genomic DNA. Cloning into cosmid vectors isdescribed in detail in Sambrook, et al., Molecular Cloning A LaboratoryManual, Second Edition, Cold Spring Harbor Laboratory Press, 1989.

4.16.5.1.6.1 Expression Control Sequence

The DNA sequence in the expression vector is operatively linked to anappropriate expression control sequence(s) (promoter) to direct RNAsynthesis. Particular named bacterial promoters include lac, lacZ, T3,T7, gpt, lambda P_(R), P_(L) and trp. Eukaryotic promoters include CMVimmediate early, HSV thymidine kinase, early and late SV40, LTRs fromretrovirus, and mouse metallothionein-1. Selection of the appropriatevector and promoter is well within the level of ordinary skill in theart. The expression vector also contains a ribosome binding site fortranslation initiation and a transcription terminator. The vector mayalso include appropriate sequences for amplifying expression. Promoterregions can be selected from any desired gene using CAT (chloramphenicoltransferase) vectors or other vectors with selectable markers.

4.16.5.1.6.2 Selectable Marker Genes

In addition, the expression vectors preferably contain one or moreselectable marker genes to provide a phenotypic trait for selection oftransformed host cells such as dihydrofolate reductase or neomycinresistance for eukaryotic cell culture, or such as tetracycline orampicillin resistance in E. coli.

Generally, recombinant expression vectors will include origins ofreplication and selectable markers permitting transformation of the hostcell, e.g., the ampicillin resistance gene of E. coli and S. cerevisiaeTRPI gene, and a promoter derived from a highly-expressed gene to directtranscription of a downstream structural sequence. Such promoters can bederived from operons encoding glycolytic enzymes such as3-phosphoglycerate kinase (PGK), -factor, acid phosphatase, or heatshock proteins, among others. The heterologous structural sequence isassembled in appropriate phase with translation initiation andtermination sequences, and preferably, a leader sequence capable ofdirecting secretion of translated protein into the periplasmic space orextracellular medium.

The cloning strategy permits expression via both vector driven andendogenous promoters; vector promotion may be important with expressionof genes whose endogenous promoter will not function in E. coli.

4.16.5.1.7 Insertion Into a Vector or Plasmid

The DNA isolated or derived from microorganisms can preferably beinserted into a vector or a plasmid prior to probing for selected DNA.Such vectors or plasmids are preferably those containing expressionregulatory sequences, including promoters, enhancers and the like. Suchpolynucleotides can be part of a vector and/or a composition and stillbe isolated, in that such vector or composition is not part of itsnatural environment. Particularly preferred phage or plasmid and methodsfor introduction and packaging into them are described in detail in theprotocol set forth herein.

The selection of the cloning vector depends upon the approach taken, forexample, the vector can be any cloning vector with an adequate capacityfor multiply repeated copies of a sequence, or multiple sequences thatcan be successfully transformed and selected in a host cell. One exampleof such a vector is described in “Polycos vectors: a system forpackaging filamentous phage and phagemid vectors using lambda phagepackaging extracts”, Alting-Mecs M A, Short J M, Gene, 1993 Dec. 27,137: 1, 93-100. Propagation/maintenance can be by an antibioticresistance carried by the cloning vector. After a period of growth, thenaturally abbreviated molecules are recovered and identified by sizefractionation on a gel or column, or amplified directly. The cloningvector utilized may contain a selectable gene that is disrupted by theinsertion of the lengthy construct. As reductive reassortmentprogresses, the number of repeated units is reduced and the interruptedgene is again expressed and hence selection for the processed constructcan be applied. The vector may be an expression/selection vector whichwill allow for the selection of an expressed product possessingdesirable biologically properties. The insert may be positioneddownstream of a functional promotor and the desirable property screenedby appropriate means.

4.16.5.1.8 Reductive Reassortment

In vivo reassortment is focused on “inter-molecular” processescollectively referred to as “recombination” which in bacteria, isgenerally viewed as a “RecA-dependent” phenomenon. The present inventioncan rely on recombination processes of a host cell to recombine andre-assort sequences, or the cells ability to mediate reductive processesto decrease the complexity of quasi-repeated sequences in the cell bydeletion. This process of “reductive reassortment” occurs by an“intra-molecular”, RecA-independent process.

Therefore, in another aspect of the present invention, novelpolynucleotides can be generated by the process of reductivereassortment. The method involves the generation of constructscontaining consecutive sequences (original encoding sequences), theirinsertion into an appropriate vector, and their subsequent introductioninto an appropriate host cell. The reassortment of the individualmolecular identities occurs by combinatorial processes between theconsecutive sequences in the construct possessing regions of homology,or between quasi-repeated units. The reassortment process recombinesand/or reduces the complexity and extent of the repeated sequences, andresults in the production of novel molecular species. Various treatmentsmay be applied to enhance the rate of reassortment. These could includetreatment with ultra-violet light, or DNA damaging chemicals, and/or theuse of host cell lines displaying enhanced levels of “geneticinstability”. Thus the reassortment process may involve homologousrecombination or the natural property of quasi-repeated sequences todirect their own evolution.

4.16.5.1.9 Repeated or “Quasi-Repeated” Sequences

Repeated or “quasi-repeated” sequences play a role in geneticinstability. In the present invention, “quasi-repeats” are repeats thatare not restricted to their original unit structure. Quasi-repeatedunits can be presented as an array of sequences in a construct;consecutive units of similar sequences. Once ligated, the junctionsbetween the consecutive sequences become essentially invisible and thequasi-repetitive nature of the resulting construct is now continuous atthe molecular level. The deletion process the cell performs to reducethe complexity of the resulting construct operates between thequasi-repeated sequences. The quasi-repeated units provide a practicallylimitless repertoire of templates upon which slippage events can occur.The constructs containing the quasi-repeats thus effectively providesufficient molecular elasticity that deletion (and potentiallyinsertion) events can occur virtually anywhere within thequasi-repetitive units.

When the quasi-repeated sequences are all ligated in the sameorientation, for instance head to tail or vice versa, the cell cannotdistinguish individual units. Consequently, the reductive process canoccur throughout the sequences. In contrast, when for example, the unitsare presented head to head, rather than head to tail, the inversiondelineates the endpoints of the adjacent unit so that deletion formationwill favor the loss of discrete units. Thus, it is preferable with thepresent method that the sequences are in the same orientation. Randomorientation of quasi-repeated sequences will result in the loss ofreassortment efficiency, while consistent orientation of the sequenceswill offer the highest efficiency. However, while having fewer of thecontiguous sequences in the same orientation decreases the efficiency,it may still provide sufficient elasticity for the effective recovery ofnovel molecules. Constructs can be made with the quasi-repeatedsequences in the same orientation to allow higher efficiency.

4.16.5.1.10 Assembly of Sequences in a Head to Tail Orientation

Sequences can be assembled in a head to tail orientation using any of avariety of methods, including the following:

-   a) Primers that include a poly-A head and poly-T tail which when    made single-stranded would provide orientation can be utilized. This    is accomplished by having the first few bases of the primers made    from RNA and hence easily removed RNAseH.-   b) Primers that include unique restriction cleavage sites can be    utilized. Multiple sites, a battery of unique sequences, and    repeated synthesis and ligation steps would be required.-   c) The inner few bases of the primer could be thiolated and an    exonuclease used to produce properly tailed molecules.    4.16.5.1.11 The Recovery of the Re-assorted Sequences

The recovery of the re-assorted sequences relies on the identificationof cloning vectors with a reduced RI. The re-assorted encoding sequencescan then be recovered by amplification. The products are re-cloned andexpressed. The recovery of cloning vectors with reduced RI can beeffected by:

-   1) The use of vectors only stably maintained when the construct is    reduced in complexity.-   2) The physical recovery of shortened vectors by physical    procedures. In this case, the cloning vector would be recovered    using standard plasmid isolation procedures and size fractionated on    either an agarose gel, or column with a low molecular weight cut off    utilizing standard procedures.-   3) The recovery of vectors containing interrupted genes which can be    selected when insert size decreases.-   4) The use of direct selection techniques with an expression vector    and the appropriate selection.

Encoding sequences (for example, genes) from related organisms maydemonstrate a high degree of homology and encode quite diverse proteinproducts. These types of sequences are particularly useful in thepresent invention as quasi-repeats. However, while the examplesillustrated below demonstrate the reassortment of nearly identicaloriginal encoding sequences (quasi-repeats), this process is not limitedto such nearly identical repeats.

The following example demonstrates the method of the invention. Encodingnucleic acid sequences (quasi-repeats) derived from three (3) uniquespecies are depicted. Each sequence encodes a protein with a distinctset of properties. Each of the sequences differs by a single or a fewbase pairs at a unique position in the sequence which are designated“A”, “B” and “C”. The quasi-repeated sequences are separately orcollectively amplified and ligated into random assemblies such that allpossible permutations and combinations are available in the populationof ligated molecules. The number of quasi-repeat units can be controlledby the assembly conditions. The average number of quasi-repeated unitsin a construct is defined as the repetitive index (RI).

Once formed, the constructs may, or may not be size fractionated on anagarose gel according to published protocols, inserted into a cloningvector, and transfected into an appropriate host cell. The cells arethen propagated and “reductive reassortment” is effected. The rate ofthe reductive reassortment process may be stimulated by the introductionof DNA damage if desired. Whether the reduction in RI is mediated bydeletion formation between repeated sequences by an “intra-molecular”mechanism, or mediated by recombination-like events through“inter-molecular” mechanisms is immaterial. The end result is areassortment of the molecules into all possible combinations.

Optionally, the method comprises the additional step of screening thelibrary members of the shuffled pool to identify individual shuffledlibrary members having the ability to bind or otherwise interact (e.g.,such as catalytic antibodies) with a predetermined macromolecule, suchas for example a proteinaceous receptor, peptide oligosaccharide, viron,or other predetermined compound or structure.

The displayed polypeptides, antibodies, peptidomimetic antibodies, andvariable region sequences that are identified from such libraries can beused for therapeutic, diagnostic, research and related purposes (e.g.,catalysts, solutes for increasing osmolarity of an aqueous solution, andthe like), and/or can be subjected to one or more additional cycles ofshuffling and/or affinity selection. The method can be modified suchthat the step of selecting for a phenotypic characteristic can be otherthan of binding affinity for a predetermined molecule (e.g., forcatalytic activity, stability oxidation resistance, drug resistance, ordetectable phenotype conferred upon a host cell).

4.16.5.1.12 Providing Antibodies Suitable for Affinity InteractionsScreening

The present invention provides a method for generating libraries ofdisplayed antibodies suitable for affinity interactions screening. Themethod comprises (1) obtaining first a plurality of selected librarymembers comprising a displayed antibody and an associated polynucleotideencoding said displayed antibody, and obtaining said associatedpolynucleotide encoding for said displayed antibody and obtaining saidassociated polynucleotides or copies thereof, wherein said associatedpolynucleotides comprise a region of substantially identical variableregion framework sequence, and (2) introducing said polynucleotides intoa suitable host cell and growing the cells under conditions whichpromote recombination and reductive reassortment resulting in shuffledpolynucleotides. CDR combinations comprised by the shuffled pool are notpresent in the first plurality of selected library members, saidshuffled pool composing a library of displayed antibodies comprising CDRpermutations and suitable for affinity interaction screening.Optionally, the shuffled pool is subjected to affinity screening toselect shuffled library members which bind to a predetermined epitope(antigen) and thereby selecting a plurality of selected shuffled librarymembers. Further, the plurality of selectively shuffled library memberscan be shuffled and screened iteratively, from 1 to about 1000 cycles oras desired until library members having a desired binding affinity areobtained.

4.16.5.1.13 Introduction of Mutations into the Original Polynucleotides

In another aspect of the invention, it is envisioned that prior to orduring recombination or reassortment, polynucleotides generated by themethod of the present invention can be subjected to agents or processeswhich promote the introduction of mutations into the originalpolynucleotides. The introduction of such mutations would increase thediversity of resulting hybrid polynucleotides and polypeptides encodedtherefrom. The agents or processes which promote mutagenesis caninclude, but are not limited to: (+)-CC-1065, or a synthetic analog suchas (+)-CC-1065-N-3-Adenine), (see. Biochem. 31, 2822-2829 (1992)); aN-acelylated or deacetylated 4′-fluro-4-aminobiphenyl adduct capable ofinhibiting DNA synthesis (see, for example, Carcinogenesis vol. 13, No.5, 751-758 (1992); or a N-acetylated or deacetylated 4-aminobiphenyladduct capable of inhibiting DNA synthesis (see also, Id. 751-758);trivalent chromium, a trivalent chromium salt, a polycyclic aromatichydrocarbon (“PAH”) DNA adduct capable of inhibiting DNA replication,such as 7-bromomethyl-benz[a]anthracene (“BMA”),tris(2,3-dibromopropyl)phosphate (“Tris-BP”),1,2-dibromo-3-chloropropane (“DBCP”), 2-bromoacrolein (2BA),benzo[a]pyrene-7,8-dihydrodiol-9-10-epoxide (“BPDE”), a platinum(II)halogen salt, N-hydroxy-2-amino-3-methylimidazo[4,5-f]-quinoline(“N-hydroxy-IQ”), andN-hydroxy-2-amino-1-methyl-6-phenylimidazo[4,5-f]-pyridine(“N-hydroxy-PhIP”). Especially preferred “means for slowing or haltingPCR amplification consist of UV light (+)-CC-1065 and(+)-CC-1065-(N-3-Adenine). Particularly encompassed means are DNAadducts or polynucleotides comprising the DNA adducts from thepolynucleotides or polynucleotides pool, which can be released orremoved by a process including heating the solution comprising thepolynucleotides prior to further processing.

4.16.5.1.14 Production Of Hybrid Or Re-Assorted Polynucleotides

In another aspect the present invention is directed to a method ofproducing recombinant proteins having biological activity by treating asample comprising double-stranded template polynucleotides encoding awild-type protein under conditions according to the present inventionwhich provide for the production of hybrid or re-assortedpolynucleotides.

4.16.5.1.15 Shuffling a Population of Viral Genes of Viral Genomes

The invention also provides the use of polynucleotide shuffling toshuffle a population of viral genes (e.g., capsid proteins, spikeglycoproteins, polymerases, and proteases) or viral genomes (e.g.,paramyxoviridae, orthomyxoviridae, herpesviruses, retroviruses,reoviruses and rhinoviruses). In an embodiment, the invention provides amethod for shuffling sequences encoding all or portions of immunogenicviral proteins to generate novel combinations of epitopes as well asnovel epitopes created by recombination; such shuffled viral proteinsmay comprise epitopes or combinations of epitopes as well as novelepitopes created by recombination; such shuffled viral proteins maycomprise epitopes or combinations of epitopes which are likely to arisein the natural environment as a consequence of viral evolution; (e.g.,such as recombination of influenza virus strains).

4.16.5.1.16 Generation of Gene Therapy Vectors and Replication-DefectiveGene Therapy Constructs

The invention also provides a method suitable for shufflingpolynucleotide sequences for generating gene therapy vectors andreplication-defective gene therapy constructs, such as may be used forhuman gene therapy, including but not limited to vaccination vectors forDNA-based vaccination, as well as anti-neoplastic gene therapy and othergeneral therapy formats.

4.16.5.2 Definitions

The term “DNA shuffling” is used herein to indicate recombinationbetween substantially homologous but non-identical sequences, in someembodiments DNA shuffling may involve crossover via non-homologousrecombination, such as via cer/10× and/or flp/fit systems and the like.

The term “amplification” means that the number of copies of apolynucleotide is increased.

The term “identical” or “identity” means that two nucleic acid sequenceshave the same sequence or a complementary sequence. Thus, “areas ofidentity” means that regions or areas of a polynucleotide or the overallpolynucleotide are identical or complementary to areas of anotherpolynucleotide or the polynucleotide.

The term “corresponds to” is used herein to mean that a polynucleotidesequence is homologous (ie., is identical, not strictly evolutionarilyrelated) to all or a portion of a reference polynucleotide sequence, orthat a polypeptide sequence is identical to a reference polypeptidesequence. In contradistinction, the term “complementary to” is usedherein to mean that the complementary sequence is homologous to all or aportion of a reference polynucleotide sequence. For illustration, thenucleotide sequence “TATAC” corresponds to a reference “TATAC” and iscomplementary to a reference sequence “GTATA.”

Genetic instability, as used herein, refers to the natural tendency ofhighly repetitive sequences to be lost through a process of reductiveevents generally involving sequence simplification through the loss ofrepeated sequences. Deletions tend to involve the loss of one copy of arepeat and everything between the repeats.

Quasi-repeated units, as used herein, refers to the repeats to bere-assorted and are by definition not identical. Indeed the method isproposed not only for practically identical encoding units produced bymutagenesis of the identical starting sequence, but also thereassortment of similar or related sequences which may divergesignificantly in some regions. Nevertheless, if the sequences containsufficient homologies to be reasserted by this approach, they can bereferred to as “quasi-repeated” units.

Reductive reassortment, as used herein, refers to the increase inmolecular diversity that is accrued through deletion (and/or insertion)events that are mediated by repeated sequences.

Repetitive Index (RI), as used herein, is the average number of copiesof the quasi-repeated units contained in the cloning vector.

The term “related polynucleotides” means that regions or areas of thepolynucleotides are identical and regions or areas of thepolynucleotides are heterologous.

The term “population” as used herein means a collection of componentssuch as polynucleotides, portions or polynucleotides or proteins. A“mixed population: means a collection of components which belong to thesame family of nucleic acids or proteins (i.e., are related) but whichdiffer in their sequence (ie., are not identical) and hence in theirbiological activity.

The term “specific polynucleotide” means a polynucleotide having certainend points and having a certain nucleic acid sequence. Twopolynucleotides wherein one polynucleotide has the identical sequence asa portion of the second polynucleotide but different ends comprises twodifferent specific polynucleotides.

The following terms are used to describe the sequence relationshipsbetween two or more polynucleotides: “reference sequence,” “comparisonwindow,” “sequence identity,” “percentage of sequence identity,” and“substantial identity.” A “reference sequence” is a defined sequenceused as a basis for a sequence comparison; a reference sequence may be asubset of a larger sequence, for example, as a segment of a full-lengthcDNA or gene sequence given in a sequence listing, or may comprise acomplete cDNA or gene sequence. Generally, a reference sequence is atleast 20 nucleotides in length, frequently at least 25 nucleotides inlength, and often at least 50 nucleotides in length. Since twopolynucleotides may each (1) comprise a sequence (i.e., a portion of thecomplete polynucleotide sequence) that is similar between the twopolynucleotides and (2) may further comprise a sequence that isdivergent between the two polynucleotides, sequence comparisons betweentwo (or more) polynucleotides are typically performed by comparingsequences of the two polynucleotides over a “comparison window” toidentify and compare local regions of sequence similarity.

A “comparison window,” as used herein, refers to a conceptual segment ofat least 20 contiguous nucleotide positions wherein a polynucleotidesequence may be compared to a reference sequence of at least 20contiguous nucleotides and wherein the portion of the polynucleotidesequence in the comparison window may comprise additions or deletions(ie., gaps) of 20 percent or less as compared to the reference sequence(which does not comprise additions or deletions) for optimal alignmentof the two sequences. Optimal alignment of sequences for aligning acomparison window may be conducted by the local homology algorithm ofSmith and Waterman (1981) Adv. Appl. Math. 2: 482 by the homologyalignment algorithm of Needlemen and Wuncsch J. Mol. Biol. 48: 443(1970), by the search of similarity method of Pearson and Lipman Proc.Natl. Acad. Sci. (U.S.A.) 85: 2444 (1988), by computerizedimplementations of these algorithms (GAP, BESTFIT, FASTA, and TFASTA inthe Wisconsin Genetics Software Package Release 7.0, Genetics ComputerGroup, 575 Science Dr., Madison, Wis.), or by inspection, and the bestalignment (i.e., resulting in the highest percentage of homology overthe comparison window) generated by the various methods is selected. Theterm “sequence identity” means that two polynucleotide sequences areidentical (i.e., on a nucleotide-by-nucleotide basis) over the window ofcomparison. The term “percentage of sequence identity” is calculated bycomparing two optimally aligned sequences over the window of comparison,determining the number of positions at which the identical nucleic acidbase (e.g., A, T, C, G, U, or 1) occurs in both sequences to yield thenumber of matched positions, dividing the number of matched positions bythe total number of positions in the window of comparison (i.e., thewindow size), and multiplying the result by 100 to yield the percentageof sequence identity. This “substantial identity”, as used herein,denotes a characteristic of a polynucleotide sequence, wherein thepolynucleotide comprises a sequence having at least 80 percent sequenceidentity, preferably at least 85 percent identity, often 90 to 95percent sequence identity, and most commonly at least 99 percentsequence identity as compared to a reference sequence of a comparisonwindow of at least 25-50 nucleotides, wherein the percentage of sequenceidentity is calculated by comparing the reference sequence to thepolynucleotide sequence which may include deletions or additions whichtotal 20 percent or less of the reference sequence over the window ofcomparison.

“Conservative amino acid substitutions” refer to the interchangeabilityof residues having similar side chains. For example, a group of aminoacids having aliphatic side chains is glycine, alanine, valine, leucine,and isoleucine; a group of amino acids having aliphatic-hydroxyl sidechains is serine and threonine; a group of amino acids havingamide-containing side chains is asparagine and glutamine; a group ofamino acids having aromatic side chains is phenylalanine, tyrosine, andtryptophan; a group of amino acids having basic side chains is lysine,arginine, and histidine; and a group of amino acids havingsulfur-containing side chains is cysteine and methionine. Preferredconservative amino acids substitution groups are:valine-leucine-isoleucine, phenylalanine-tyrosine, lysine-arginine,alanine-valine, and asparagine-glutamine.

The term “homologous” or “homeologous” means that one single-strandednucleic acid nucleic acid sequence may hybridize to a complementarysingle-stranded nucleic acid sequence. The degree of hybridization maydepend on a number of factors including the amount of identity betweenthe sequences and the hybridization conditions such as temperature andsalt concentrations as discussed later. Preferably the region ofidentity is greater than about 5 bp, more preferably the region ofidentity is greater than 10 bp.

The term “heterologous” means that one single-stranded nucleic acidsequence is unable to hybridize to another single-stranded nucleic acidsequence or its complement. Thus areas of heterology means that areas ofpolynucleotides or polynucleotides have areas or regions within theirsequence which are unable to hybridize to another nucleic acid orpolynucleotide. Such regions or areas are, for example areas ofmutations.

The term “cognate” as used herein refers to a gene sequence that isevolutionarily and functionally related between species. For example butnot limitation, in the human genome the human CD4 gene is the cognategene to the mouse 3d4 gene, since the sequences and structures of thesetwo genes indicate that they are highly homologous and both genes encodea protein which functions in signaling T cell activation through MHCclass II-restricted antigen recognition.

The term “wild-type” means that the polynucleotide does not comprise anymutations. A “wild type” protein means that the protein will be activeat a level of activity found in nature and will comprise the amino acidsequence found in nature.

The term “mutations” means changes in the sequence of a wild-typenucleic acid sequence or changes in the sequence of a peptide. Suchmutations may be point mutations such as transitions or transversions.The mutations may be deletions, insertions or duplications.

In the polypeptide notation used herein, the left-hand direction is theamino terminal direction and the right-hand direction is thecarboxy-terminal direction, in accordance with standard usage andconvention. Similarly, unless specified otherwise, the left-hand end ofsingle-stranded polynucleotide sequences is the 5′ end; the left-handdirection of double-stranded polynucleotide sequences is referred to asthe 5′ direction. The direction of 5′ to 3′ addition of nascent RNAtranscripts is referred to as the transcription direction; sequenceregions on the DNA strand having the same sequence as the RNA and whichare 5′ to the 5′ end of the RNA transcript are referred to as “upstreamsequences”; sequence regions on the DNA strand having the same sequenceas the RNA and which are 3′ to the 3′ end of the coding RNA transcriptare referred to as “downstream sequences”.

The term “naturally-occurring” as used herein as applied to the objectrefers to the fact that an object can be found in nature. For example, apolypeptide or polynucleotide sequence that is present in an organism(including viruses) that can be isolated from a source in nature andwhich has not been intentionally modified by man in the laboratory isnaturally occurring. Generally, the term naturally occurring refers toan object as present in a non-pathological (un-diseased) individual,such as would be typical for the species.

The term “agent” is used herein to denote a chemical compound, a mixtureof chemical compounds, an array of spatially localized compounds (e.g.,a VLSIPS peptide array, polynucleotide array, and/or combinatorial smallmolecule array), biological macromolecule, a bacteriophage peptidedisplay library, a bacteriophage antibody (e.g., scFv) display library,a polysome peptide display library, or an extract made form biologicalmaterials such as bacteria, plants, fungi, or animal (particularmammalian) cells or tissues. Agents are evaluated for potential activityas anti-neoplastics, anti-inflammatories or apoptosis modulators byinclusion in screening assays described hereinbelow. Agents areevaluated for potential activity as specific protein interactioninhibitors (i.e., an agent which selectively inhibits a bindinginteraction between two predetermined polypeptides but which doe snotsubstantially interfere with cell viability) by inclusion in screeningassays described hereinbelow.

As used herein, “substantially pure” means an object species is thepredominant species present (i.e., on a molar basis it is more abundantthan any other individual macromolecular species in the composition),and preferably substantially purified fraction is a composition whereinthe object species comprises at least about 50 percent (on a molarbasis) of all macromolecular species present. Generally, a substantiallypure composition will comprise more than about 80 to 90 percent of allmacromolecular species present in the composition. Most preferably, theobject species is purified to essential homogeneity (contaminant speciescannot be detected in the composition by conventional detection methods)wherein the composition consists essentially of a single macromolecularspecies. Solvent species, small molecules (<500 Daltons), and elementalion species are not considered macromolecular species.

As used herein the term “physiological conditions” refers totemperature, pH, ionic strength, viscosity, and like biochemicalparameters which are compatible with a viable organism, and/or whichtypically exist intracellularly in a viable cultured yeast cell ormammalian cell. For example, the intracellular conditions in a yeastcell grown under typical laboratory culture conditions are physiologicalconditions. Suitable in vitro reaction conditions for in vitrotranscription cocktails are generally physiological conditions. Ingeneral, in vitro physiological conditions comprise 50-200 mM NaCl orKCl, pH 6.5-8.5, 20-45 C and 0.001-10 mM divalent cation (e.g., Mg⁺⁺,Ca⁺⁺); preferably about 150 mM NaCl or KCl, pH 7.2-7.6, 5 mM divalentcation, and often include 0.01-1.0 percent nonspecific protein (e.g.,BSA). A non-ionic detergent (Tween, NP-40, Triton X-100) can often bepresent, usually at about 0.001 to 2%, typically 0.05-0.2% (v/v).Particular aqueous conditions may be selected by the practitioneraccording to conventional methods. For general guidance, the followingbuffered aqueous conditions may be applicable: 10-250 mM NaCl, 5-50 mMTris HCl, pH 5-8, with optional addition of divalent cation(s) and/ormetal chelators and/or non-ionic detergents and/or membrane fractionsand/or anti-foam agents and/or scintillants.

“Specific hybridization” is defined herein as the formation of hybridsbetween a first polynucleotide and a second polynucleotide (e.g., apolynucleotide having a distinct but substantially identical sequence tothe first polynucleotide), wherein substantially unrelatedpolynucleotide sequences do not form hybrids in the mixture.

As used herein, the term “single-chain antibody” refers to a polypeptidecomprising a V_(H) domain and a V_(L) domain in polypeptide linkage,generally liked via a spacer peptide (e.g., [Gly-Gly-Gly-Gly-Ser]_(x)),and which may comprise additional amino acid sequences at the amino-and/or carboxy-termini. For example, a single-chain antibody maycomprise a tether segment for linking to the encoding polynucleotide. Asan example, a scFv is a single-chain antibody. Single-chain antibodiesare generally proteins consisting of one or more polypeptide segments ofat least 10 contiguous amino substantially encoded by genes of theimmunoglobulin superfamily (e.g., see The Immunoglobulin GeneSuperfamily, A. F. Williams and A. N. Barclay, in Immunoglobulin Genes,T. Honjo, F. W. Alt, and THE. Rabbits, eds., (1989) Academic press: SanDiego, Calif., pp. 361-368, which is incorporated herein by reference),most frequently encoded by a rodent, non-human primate, avian, porcinebovine, ovine, goat, or human heavy chain or light chain gene sequence.A functional single-chain antibody generally contains a sufficientportion of an immunoglobulin superfamily gene product so as to retainthe property of binding to a specific target molecule, typically areceptor or antigen (epitope).

As used herein, the term “complementarity-determining region” and “CDR”refer to the art-recognized term as exemplified by the Kabat and ChothiaCDR definitions also generally known as supervariable regions orhypervariable loops (Chothia and Leks (1987) J. Mol. Biol. 196; 901;Clothia et al. (1989) Nature 342; 877; E. A. Kabat et al., Sequences ofProteins of Immunological Interest (national Institutes of Health,Bethesda, Md.) (1987); and Tramontano et al. (1990) J. Mol. Biolog. 215;175). Variable region domains typically comprise the amino-terminalapproximately 105-115 amino acids of a naturally-occurringimmunoglobulin chain (e.g., amino acids 1-110), although variabledomains somewhat shorter or longer are also suitable for formingsingle-chain antibodies.

An immunoglobulin light or heavy chain variable region consists of a“framework” region interrupted by three hypervariable regions, alsocalled CDR's. The extent of the framework region and CDR's have beenprecisely defined (see, “Sequences of Proteins of ImmunologicalInterest,” E. Kabat et al., 4th Ed., U.S. Department of Health and humanservices, Bethesda, Md. (1987)). The sequences of the framework regionsof different light or heavy chains are relatively conserved within aspecie. As used herein, a “human framework region” is a framework regionthat is substantially identical (about 85 or more, usually 90-95 ormore) to the framework region of a naturally occurring humanimmunoglobulin the framework region of an antibody, that is the combinedframework regions of the constituent light and heavy chains, serves toposition and align the CDR's. The CDR's are primarily responsible forbinding to an epitope of an antigen.

As used herein, the term “variable segment” refers to a portion of anascent peptide which comprises a random, pseudorandom, or definedkernal sequence. A variable segment” refers to a portion of a nascentpeptide which comprises a random pseudorandom, or defined kernalsequence. A variable segment can comprise both variant and invariantresidue positions, and the degree of residue variation at a variantresidue position may be limited: both options are selected at thediscretion of the practitioner. Typically, variable segments are about 5to 20 amino acid residues in length (e.g., 8 to 10), although variablesegments may be longer and may comprise antibody portions or receptorproteins, such as an antibody fragment, a nucleic acid binding protein,a receptor protein, and the like.

As used herein, “random peptide sequence” refers to an amino acidsequence composed of two or more amino acid monomers and constructed bya stochastic or random process. A random peptide can include frameworkor scaffolding motifs, which may comprise invariant sequences.

As used herein “random peptide library” refers to a set ofpolynucleotide sequences that encodes a set of random peptides, and tothe set of random peptides encoded by those polynucleotide sequences, aswell as the fusion proteins contain those random peptides.

As used herein, the term “pseudorandom” refers to a set of sequencesthat have limited variability, such that, for example, the degree ofresidue variability at another position, but any pseudorandom positionis allowed some degree of residue variation, however circumscribed.

As used herein, the term “defined sequence framework” refers to a set ofdefined sequences that are selected on a non-random basis, generally onthe basis of experimental data or structural data; for example, adefined sequence framework may comprise a set of amino acid sequencesthat are predicted to form a β-sheet structure or may comprise a leucinezipper heptad repeat motif, a zinc-finger domain, among othervariations. A “defined sequence kernal” is a set of sequences whichencompass a limited scope of variability. Whereas (1) a completelyrandom 10-mer sequence of the 20 conventional amino acids can be any of(20)¹⁰ sequences, and (2) a pseudorandom 10-mer sequence of the 20conventional amino acids can be any of (20)¹⁰ sequences but will exhibita bias for certain residues at certain positions and/or overall, (3) adefined sequence kernal is a subset of sequences if each residueposition was allowed to be any of the allowable 20 conventional aminoacids (and/or allowable unconventional amino/imino acids). A definedsequence kernal generally comprises variant and invariant residuepositions and/or comprises variant residue positions which can comprisea residue selected from a defined subset of amino acid residues), andthe like, either segmentally or over the entire length of the individualselected library member sequence. Defined sequence kernels can refer toeither amino acid sequences or polynucleotide sequences. Of illustrationand not limitation, the sequences (NNK)₁₀ and (NNM)₁₀, wherein Nrepresents A, T, G, or C; K represents G or T; and M represents A or C,are defined sequence kernels.

As used herein “epitope” refers to that portion of an antigen or othermacromolecule capable of forming a binding interaction that interactswith the variable region binding body of an antibody. Typically, suchbinding interaction is manifested as an intermolecular contact with oneor more amino acid residues of a CDR.

As used herein, “receptor” refers to a molecule that has an affinity fora given ligand. Receptors can be naturally occurring or syntheticmolecules. Receptors can be employed in an unaltered state or asaggregates with other species. Receptors can be attached, covalently ornon-covalently, to a binding member, either directly or via a specificbinding substance. Examples of receptors include, but are not limitedto, antibodies, including monoclonal antibodies and antisera reactivewith specific antigenic determinants (such as on viruses, cells, orother materials), cell membrane receptors, complex carbohydrates andglycoproteins, enzymes, and hormone receptors.

As used herein “ligand” refers to a molecule, such as a random peptideor variable segment sequence, that is recognized by a particularreceptor. As one of skill in the art will recognize, a molecule (ormacromolecular complex) can be both a receptor and a ligand. In general,the binding partner having a smaller molecular weight is referred to asthe ligand and the binding partner having a greater molecular weight isreferred to as a receptor.

As used herein, “linker” or “spacer” refers to a molecule or group ofmolecules that connects two molecules, such as a DNA binding protein anda random peptide, random peptide can bind to a receptor with minimalsteric hindrance from the DNA binding protein.

4.16.5.3 Methodology

Nucleic acid shuffling is a method for in vitro or in vivo homologousrecombination of pools of shorter or smaller polynucleotides to producea polynucleotide or polynucleotides. Mixtures of related nucleic acidsequences or polynucleotides are subjected to sexual PCR to providerandom polynucleotides, and reassembled to yield a library or mixedpopulation of recombinant hybrid nucleic acid molecules orpolynucleotides.

In contrast to cassette mutagenesis, only shuffling and error-prone PCRallow one to mutate a pool of sequences blindly (without sequenceinformation other than primers).

4.16.5.3.1 Advantage of the Mutagenic Shuffling

The advantage of the mutagenic shuffling of this invention overerror-prone PCR alone for repeated selection can best be explained withan example from antibody engineering.

4.16.5.3.2 Inverse Chain Reaction

This method differs from error-prone PCR, in that it is an inverse chainreaction. In error-prone PCR, the number of polymerase start sites andthe number of molecules grows exponentially. However, the sequence ofthe polymerase start sites and the sequence of the molecules remainsessentially the same. In contrast, in nucleic acid reassembly orshuffling of random polynucleotides the number of start sites and thenumber (but not size) of the random polynucleotides decreases over time.For polynucleotides derived from whole plasmids the theoretical endpointis a single, large concatemeric molecule.

Since cross-overs occur at regions of homology, recombination willprimarily occur between members of the same sequence family. Thisdiscourages combinations of CDRs that are grossly incompatible (e.g.,directed against different epitopes of the same antigen). It iscontemplated that multiple families of sequences can be shuffled in thesame reaction. Further, shuffling generally conserves the relativeorder, such that, for example, CDR1 will not be found in the position ofCDR2.

Rare shufflants will contain a large number of the best (eg. highestaffinity) CDRs and these rare shufflants may be selected based on theirsuperior affinity.

CDRs from a pool of 100 different selected antibody sequences can bepermutated in up to 1006 different ways. This large number ofpermutations cannot be represented in a single library of DNA sequences.Accordingly, it is contemplated that multiple cycles of DNA shufflingand selection may be required depending on the length of the sequenceand the sequence diversity desired.

Error-prone PCR, in contrast, keeps all the selected CDRs in the samerelative sequence, generating a much smaller mutant cloud.

4.16.5.3.3 The Template Polynucleotide

The template polynucleotide which may be used in the methods of thisinvention may be DNA or RNA. It may be of various lengths depending onthe size of the gene or shorter or smaller polynucleotide to berecombined or reassembled. Preferably, the template polynucleotide isfrom 50 bp to 50 kb. It is contemplated that entire vectors containingthe nucleic acid encoding the protein of interest can be used in themethods of this invention, and in fact have been successfully used.

The template polynucleotide may be obtained by amplification using thePCR reaction (U.S. Pat. Nos. 4,683,202 and 4,683,195) or otheramplification or cloning methods. However, the removal of free primersfrom the PCR products before subjecting them to pooling of the PCRproducts and sexual PCR may provide more efficient results. Failure toadequately remove the primers from the original pool before sexual PCRcan lead to a low frequency of crossover clones.

The template polynucleotide often should be double-stranded. Adouble-stranded nucleic acid molecule is recommended to ensure thatregions of the resulting single-stranded polynucleotides arecomplementary to each other and thus can hybridize to form adouble-stranded molecule.

It is contemplated that single-stranded or double-stranded nucleic acidpolynucleotides having regions of identity to the templatepolynucleotide and regions of heterology to the template polynucleotidemay be added to the template polynucleotide, at this step. It is alsocontemplated that two different but related polynucleotide templates canbe mixed at this step.

The double-stranded polynucleotide template and any added double- orsingle-stranded polynucleotides are subjected to sexual PCR whichincludes slowing or halting to provide a mixture of from about 5 bp to 5kb or more. Preferably the size of the random polynucleotides is fromabout 10 bp to 1000 bp, more preferably the size of the polynucleotidesis from about 20 bp to 500 bp.

4.16.5.3.4 Use of Double-Stranded Nucleic Acid Having Multiple Nicks

Alternatively, it is also contemplated that double-stranded nucleic acidhaving multiple nicks may be used in the methods of this invention. Anick is a break in one strand of the double-stranded nucleic acid. Thedistance between such nicks is preferably 5 bp to 5 kb, more preferablybetween 10 bp to 1000 bp. This can provide areas of self-priming toproduce shorter or smaller polynucleotides to be included with thepolynucleotides resulting from random primers, for example.

The concentration of any one specific polynucleotide will not be greaterthan 1% by weight of the total polynucleotides, more preferably theconcentration of any one specific nucleic acid sequence will not begreater than 0.1% by weight of the total nucleic acid.

The number of different specific polynucletides in the mixture will beat least about 100, preferably at least about 500, and more preferablyat least about 1000.

4.16.5.3.5 Increasing the Heterogeneity of the Mixture ofPolynucleotides

At this step single-stranded or double-stranded polynucleotides, eithersynthetic or natural, may be added to the random double-stranded shorteror smaller polynucleotides in order to increase the heterogeneity of themixture of polynucleotides.

It is also contemplated that populations of double-stranded randomlybroken polynucleotides may be mixed or combined at this step with thepolynucleotides from the sexual PCR process and optionally subjected toone or more additional sexual PCR cycles.

Where insertion of mutations into the template polynucleotide isdesired, single-stranded or double-stranded polynucleotides having aregion of identity to the template polynucleotide and a region ofheterology to the template polynucleotide may be added in a 20 foldexcess by weight as compared to the total nucleic acid, more preferablythe single-stranded polynucleotides may be added in a 10 fold excess byweight as compared to the total nucleic acid.

Where a mixture of different but related template polynucleotides isdesired, populations of polynucleotides from each of the templates maybe combined at a ratio of less than about 1:100, more preferably theratio is less than about 1: 40. For example, a backcross of thewild-type polynucleotide with a population of mutated polynucleotide maybe desired to eliminate neutral mutations (e.g., mutations yielding aninsubstantial alteration in the phenotypic property being selected for).In such an example, the ratio of randomly provided wild-typepolynucleotides which may be added to the randomly provided sexual PCRcycle hybrid polynucleotides is approximately 1:1 to about 100:1, andmore preferably from 1:1 to 40:1.

4.16.5.3.5.1 Denaturing and Re-annealing

The mixed population of random polynucleotides are denatured to formsingle-stranded polynucleotides and then re-annealed. Only thosesingle-stranded polynucleotides having regions of homology with othersingle-stranded polynucleotides will re-anneal.

The random polynucleotides may be denatured by heating. One skilled inthe art could determine the conditions necessary to completely denaturethe double-stranded nucleic acid. Preferably the temperature is from 80C to 100 C, more preferably the temperature is from 90 C to 96 C. othermethods which may be used to denature the polynucleotides includepressure (36) and pH.

The polynucleotides may be re-annealed by cooling. Preferably thetemperature is from 20 C to 75 C, more preferably the temperature isfrom 40 C to 65 C. If a high frequency of crossovers is needed based onan average of only 4 consecutive bases of homology, recombination can beforced by using a low annealing temperature, although the processbecomes more difficult. The degree of renaturation which occurs willdepend on the degree of homology between the population ofsingle-stranded polynucleotides.

Renaturation can be accelerated by the addition of polyethylene glycol(“PEG”) or salt. The salt concentration is preferably from 0 mM to 200mM, more preferably the salt concentration is from 10 mM to 100 mm. Thesalt may be KCl or NaCl. The concentration of PEG is preferably from 0%to 20%, more preferably from 5% to 10%.

4.16.5.3.5.2 Incubation

The annealed polynucleotides are next incubated in the presence of anucleic acid polymerase and dNTP's (i.e. dATP, dCTP, DGTP and dTTP). Thenucleic acid polymerase may be the Klenow fragment, the Taq polymeraseor any other DNA polymerase known in the art.

The approach to be used for the assembly depends on the minimum degreeof homology that should still yield crossovers. If the areas of identityare large, Taq polymerase can be used with an annealing temperature ofbetween 45-65 C. If the areas of identity are small, Klenow polymerasecan be used with an annealing temperature of between 20-30 C. Oneskilled in the art could vary the temperature of annealing to increasethe number of cross-overs achieved.

The polymerase may be added to the random polynucleotides prior toannealing, simultaneously with annealing or after annealing.

The cycle of denaturation, renaturation and incubation in the presenceof polymerase is referred to herein as shuffling or reassembly of thenucleic acid. This cycle is repeated for a desired number of times.Preferably the cycle is repeated from 2 to 50 times, more preferably thesequence is repeated from 10 to 40 times.

4.16.5.3.6 The Resulting Nucleic Acid

The resulting nucleic acid is a larger double-stranded polynucleotide offrom about 50 bp to about 100 kb, preferably the larger polynucleotideis from 500 bp to 50 kb.

This larger polynucleotides may contain a number of copies of apolynucleotide having the same size as the template polynucleotide intandem. This concatemeric polynucleotide is then denatured into singlecopies of the template polynucleotide. The result will be a populationof polynucleotides of approximately the same size as the templatepolynucleotide. The population will be a mixed population where singleor double-stranded polynucleotides having an area of identity and anarea of heterology have been added to the template polynucleotide priorto shuffling.

These polynucleotides are then cloned into the appropriate vector andthe ligation mixture used to transform bacteria.

It is contemplated that the single polynucleotides may be obtained fromthe larger concatemeric polynucleotide by amplification of the singlepolynucleotide prior to cloning by a variety of methods including PCR(U.S. Pat. Nos. 4,683,195 and 4,683,202), rather than by digestion ofthe concatemer.

4.16.5.3.7 Vectors Used for Cloning

The vector used for cloning is not critical provided that it will accepta polynucleotide of the desired size. If expression of the particularpolynucleotide is desired, the cloning vehicle should further comprisetranscription and translation signals next to the site of insertion ofthe polynucleotide to allow expression of the polynucleotide in the hostcell. Preferred vectors include the pUC series and the pBR series ofplasmids.

4.16.5.3.8 The Resulting Bacterial Population

The resulting bacterial population will include a number of recombinantpolynucleotides having random mutations. This mixed population may betested to identify the desired recombinant polynucleotides. The methodof selection will depend on the polynucleotide desired.

For example, if a polynucleotide which encodes a protein with increasedbinding efficiency to a ligand is desired, the proteins expressed byeach of the portions of the polynucleotides in the population or librarymay be tested for their ability to bind to the ligand by methods knownin the art (i.e. panning, affinity chromatography). If a polynucleotidewhich encodes for a protein with increased drug resistance is desired,the proteins expressed by each of the polynucleotides in the populationor library may be tested for their ability to confer drug resistance tothe host organism. One skilled in the art, given knowledge of thedesired protein, could readily test the population to identifypolynucleotides which confer the desired properties onto the protein.

It is contemplated that one skilled in the art could use a phage displaysystem in which fragments of the protein are expressed as fusionproteins on the phage surface (Pharmacia, Milwaukee Wis.). Therecombinant DNA molecules are cloned into the phage DNA at a site whichresults in the transcription of a fusion protein a portion of which isencoded by the recombinant DNA molecule. The phage containing therecombinant nucleic acid molecule undergoes replication andtranscription in the cell. The leader sequence of the fusion proteindirects the transport of the fusion protein to the tip of the phageparticle. Thus the fusion protein which is partially encoded by therecombinant DNA molecule is displayed on the phage particle fordetection and selection by the methods described above.

4.16.5.3.9 Cycles of Nucleic Acid Shuffling

It is further contemplated that a number of cycles of nucleic acidshuffling may be conducted with polynucleotides from a sub-population ofthe first population, which sub-population contains DNA encoding thedesired recombinant protein. In this manner, proteins with even higherbinding affinities or enzymatic activity could be achieved.

It is also contemplated that a number of cycles of nucleic acidshuffling may be conducted with a mixture of wild-type polynucleotidesand a sub-population of nucleic acid from the first or subsequent roundsof nucleic acid shuffling in order to remove any silent mutations fromthe sub-population.

4.16.5.3.10 The Starting Nucleic Acid

Any source of nucleic acid, in purified form can be utilized as thestarting nucleic acid. Thus the process may employ DNA or RNA includingmessenger RNA, which DNA or RNA may be single or double stranded. Inaddition, a DNA-RNA hybrid which contains one strand of each may beutilized. The nucleic acid sequence may be of various lengths dependingon the size of the nucleic acid sequence to be mutated. Preferably thespecific nucleic acid sequence is from 50 to 50000 base pairs. It iscontemplated that entire vectors containing the nucleic acid encodingthe protein of interest may be used in the methods of this invention.

The nucleic acid may be obtained from any source, for example, fromplasmids such a pBR322, from cloned DNA or RNA or from natural DNA orRNA from any source including bacteria, yeast, viruses and higherorganisms such as plants or animals. DNA or RNA may be extracted fromblood or tissue material. The template polynucleotide may be obtained byamplification using the polynucleotide chain reaction (PCR) (U.S. Pat.Nos. 4,683,202 and 4,683,195). Alternatively, the polynucleotide may bepresent in a vector present in a cell and sufficient nucleic acid may beobtained by culturing the cell and extracting the nucleic acid from thecell by methods known in the art.

Any specific nucleic acid sequence can be used to produce the populationof hybrids by the present process. It is only necessary that a smallpopulation of hybrid sequences of the specific nucleic acid sequenceexist or be created prior to the present process.

4.16.5.3.11 Creation of the Initial Population of Sequences

The initial small population of the specific nucleic acid sequenceshaving mutations may be created by a number of different methods.Mutations may be created by error-prone PCR. Error-prone PCR useslow-fidelity polymerization conditions to introduce a low level of pointmutations randomly over a long sequence. Alternatively, mutations can beintroduced into the template polynucleotide by oligonucleotide-directedmutagenesis. In oligonucleotide-directed mutagenesis, a short sequenceof the polynucleotide is removed from the polynucleotide usingrestriction enzyme digestion and is replaced with a syntheticpolynucleotide in which various bases have been altered from theoriginal sequence. The polynucleotide sequence can also be altered bychemical mutagenesis. Chemical mutagens include, for example, sodiumbisulfite, nitrous acid, hydroxylamine, hydrazine or formic acid. otheragents which are analogues of nucleotide precursors includenitrosoguanidine, 5-bromouracil, 2-aminopurine, or acridine. Generally,these agents are added to the PCR reaction in place of the nucleotideprecursor thereby mutating the sequence. Intercalating agents such asproflavine, acriflavine, quinacrine and the like can also be used.Random mutagenesis of the polynucleotide sequence can also be achievedby irradiation with X-rays or ultraviolet light. Generally, plasmidpolynucleotides so mutagenized are introduced into E. coli andpropagated as a pool or library of hybrid plasmids.

Alternatively the small mixed population of specific nucleic acids maybe found in nature in that they may consist of different alleles of thesame gene or the same gene from different related species (i.e., cognategenes). Alternatively, they may be related DNA sequences found withinone species, for example, the immunoglobulin genes.

Once the mixed population of the specific nucleic acid sequences isgenerated, the polynucleotides can be used directly or inserted into anappropriate cloning vector, using techniques well-known in the art.

4.16.5.3.11.1 The Choice of Vector

The choice of vector depends on the size of the polynucleotide sequenceand the host cell to be employed in the methods of this invention. Thetemplates of this invention may be plasmids, phages, cosmids, phagemids,viruses (e.g., retroviruses, parainfluenzavirus, herpesviruses,reoviruses, paramyxoviruses, and the like), or selected portions thereof(e.g., coat protein, spike glycoprotein, capsid protein). For example,cosmids and phagemids are preferred where the specific nucleic acidsequence to be mutated is larger because these vectors are able tostably propagate large polynucleotides.

4.16.5.3.11.2 Clonal Amplification

If the mixed population of the specific nucleic acid sequence is clonedinto a vector it can be clonally amplified by inserting each vector intoa host cell and allowing the host cell to amplify the vector. This isreferred to as clonal amplification because while the absolute number ofnucleic acid sequences increases, the number of hybrids does notincrease. Utility can be readily determined by screening expressedpolypeptides.

4.16.5.3.12 Incorporation of Any Sequence Mixture at Any SpecificPosition

The DNA shuffling method of this invention can be performed blindly on apool of unknown sequences. By adding to the reassembly mixtureoligonucleotides (with ends that are homologous to the sequences beingreassembled) any sequence mixture can be incorporated at any specificposition into another sequence mixture. Thus, it is contemplated thatmixtures of synthetic oligonucleotides, PCR polynucleotides or evenwhole genes can be mixed into another sequence library at definedpositions. The insertion of one sequence (mixture) is independent fromthe insertion of a sequence in another part of the template. Thus, thedegree of recombination, the homology required, and the diversity of thelibrary can be independently and simultaneously varied along the lengthof the reassembled DNA.

This approach of mixing two genes may be useful for the humanization ofantibodies sequences into genes may be useful for any therapeuticallyused protein, for example, interleukin 1, antibodies, tPA and growthhormone. The approach may also be useful in any nucleic acid forexample, promoters or introns or 3′ untranslated region or 5′untranslated regions of genes to increase expression or alterspecificity of expression of proteins. The approach may also be used tomutate ribozymes or aptamers.

4.16.5.3.13 Creation of Scaffold-Like Proteins

Shuffling requires the presence of homologous regions separating regionsof diversity. Scaffold-like protein structures may be particularlysuitable for shuffling. The conserved scaffold determines the overallfolding by self-association, while displaying relatively unrestrictedloops that mediate the specific binding. Examples of such scaffolds arethe immunoglobulin beta-barrel, and the four-helix bundle which arewell-known in the art. This shuffling can be used to createscaffold-like proteins with various combinations of mutated sequencesfor binding.

4.16.5.4 In vitro Shuffling

The equivalents of some standard genetic matings may also be performedby shuffling in vitro. For example, a “molecular backcross” can beperformed by repeatedly mixing the hybrid's nucleic acid with thewild-type nucleic acid while selecting for the mutations of interest. Asin traditional breeding, this approach can be used to combine phenotypesfrom different sources into a background of choice. It is useful, forexample, for the removal of neutral mutations that affect unselectedcharacteristics (i.e. immunogenicity). Thus it can be useful todetermine which mutations in a protein are involved in the enhancedbiological activity and which are not, an advantage which cannot beachieved by error-prone mutagenesis or cassette mutagenesis methods.

Large, functional genes can be assembled correctly from a mixture ofsmall random polynucleotides. This reaction may be of use for thereassembly of genes from the highly fragmented DNA of fossils. Inaddition random nucleic acid fragments from fossils may be combined withpolynucleotides from similar genes from related species.

4.16.5.4.1 In Vitro Amplification of a Genome

It is also contemplated that the method of this invention can be usedfor the in vitro amplification of a whole genome from a single cell asis needed for a variety of research and diagnostic applications. DNAamplification by PCR is in practice limited to a length of about 40 kb.Amplification of a whole genome such as that of E. coli (5,000 kb) byPCR would require about 250 primers yielding 125 forty kbpolynucleotides. This approach is not practical due to theunavailability of sufficient sequence data. On the other hand, randomproduction of polynucleotides of the genome with sexual PCR cycles,followed by gel purification of small polynucleotides will provide amultitude of possible primers. Use of this mix of random smallpolynucleotides as primers in a PCR reaction alone or with the wholegenome as the template should result in an inverse chain reaction withthe theoretical endpoint of a single concatemer containing many copiesof the genome.

100 fold amplification in the copy number and an average polynucleotidesize of greater than 50 kb may be obtained when only randompolynucleotides are used. It is thought that the larger concatemer isgenerated by overlap of many smaller polynucleotides. The quality ofspecific PCR products obtained using synthetic primers will beindistinguishable from the product obtained from unamplified DNA. It isexpected that this approach will be useful for the mapping of genomes.

The polynucleotide to be shuffled can be produced as random ornon-random polynucleotides, at the discretion of the practitioner.

4.16.5.5 In Vivo Shuffling

In an embodiment of in vivo shuffling, the mixed population of thespecific nucleic acid sequence is introduced into bacterial oreukaryotic cells under conditions such that at least two differentnucleic acid sequences are present in each host cell. Thepolynucleotides can be introduced into the host cells by a variety ofdifferent methods. The host cells can be transformed with the smallerpolynucleotides using methods known in the art, for example treatmentwith calcium chloride. If the polynucleotides are inserted into a phagegenome, the host cell can be transfected with the recombinant phagegenome having the specific nucleic acid sequences. Alternatively, thenucleic acid sequences can be introduced into the host cell usingelectroporation, transfection, lipofection, biolistics, conjugation, andthe like.

In general, in this embodiment, the specific nucleic acids sequenceswill be present in vectors which are capable of stably replicating thesequence in the host cell. In addition, it is contemplated that thevectors will encode a marker gene such that host cells having the vectorcan be selected. This ensures that the mutated specific nucleic acidsequence can be recovered after introduction into the host cell.However, it is contemplated that the entire mixed population of thespecific nucleic acid sequences need not be present on a vectorsequence. Rather only a sufficient number of sequences need be clonedinto vectors to ensure that after introduction of the polynucleotidesinto the host cells each host cell contains one vector having at leastone specific nucleic acid sequence present therein. It is alsocontemplated that rather than having a subset of the population of thespecific nucleic acids sequences cloned into vectors, this subset may bealready stably integrated into the host cell.

4.16.5.5.1 Homologous Recombination

It has been found that when two polynucleotides which have regions ofidentity are inserted into the host cells homologous recombinationoccurs between the two polynucleotides. Such recombination between thetwo mutated specific nucleic acid sequences will result in theproduction of double or triple hybrids in some situations.

4.16.5.5.2 Increase in the Frequency of Recombination

It has also been found that the frequency of recombination is increasedif some of the mutated specific nucleic acid sequences are present onlinear nucleic acid molecules. Therefore, in a preferred embodiment,some of the specific nucleic acid sequences are present on linearpolynucleotides.

4.16.5.5.3 Identification of Host Cell Transformants Containing DesiredSequences

After transformation, the host cell transformants are placed underselection to identify those host cell transformants which containmutated specific nucleic acid sequences having the qualities desired.For example, if increased resistance to a particular drug is desiredthen the transformed host cells may be subjected to increasedconcentrations of the particular drug and those transformants producingmutated proteins able to confer increased drug resistance will beselected. If the enhanced ability of a particular protein to bind to areceptor is desired, then expression of the protein can be induced fromthe transformants and the resulting protein assayed in a ligand bindingassay by methods known in the art to identify that subset of the mutatedpopulation which shows enhanced binding to the ligand. Alternatively,the protein can be expressed in another system to ensure properprocessing.

Once a subset of the first recombined specific nucleic acid sequences(daughter sequences) having the desired characteristics are identified,they are then subject to a second round of recombination.

4.16.5.5.4 The Second Cycle of Recombination

In the second cycle of recombination, the recombined specific nucleicacid sequences may be mixed with the original mutated specific nucleicacid sequences (parent sequences) and the cycle repeated as describedabove. In this way a set of second recombined specific nucleic acidssequences can be identified which have enhanced characteristics orencode for proteins having enhanced properties. This cycle can berepeated a number of times as desired.

It is also contemplated that in the second or subsequent recombinationcycle, a backcross can be performed. A molecular backcross can beperformed by mixing the desired specific nucleic acid sequences with alarge number of the wild-type sequence, such that at least one wild-typenucleic acid sequence and a mutated nucleic acid sequence are present inthe same host cell after transformation. Recombination with thewild-type specific nucleic acid sequence will eliminate those neutralmutations that may affect unselected characteristics such asimmunogenicity but not the selected characteristics.

4.16.5.5.5 Generation of a Subset of the Specific Nucleic Acid Sequences

In another embodiment of this invention, it is contemplated that duringthe first round a subset of the specific nucleic acid sequences can begenerated as smaller polynucleotides by slowing or halting their PCRamplification prior to introduction into the host cell. The size of thepolynucleotides must be large enough to contain some regions of identitywith the other sequences so as to homologously recombine with the othersequences. The size of the polynucleotides will range from 0.03 kb to100 kb more preferably from 0.2 kb to 10 kb. It is also contemplatedthat in subsequent rounds, all of the specific nucleic acid sequencesother than the sequences selected from the previous round may beutilized to generate PCR polynucleotides prior to introduction into thehost cells.

The shorter polynucleotide sequences can be single-stranded ordouble-stranded. If the sequences were originally single-stranded andhave become double-stranded they can be denatured with heat, chemicalsor enzymes prior to insertion into the host cell. The reactionconditions suitable for separating the strands of nucleic acid are wellknown in the art.

The steps of this process can be repeated indefinitely, being limitedonly by the number of possible hybrids which can be achieved. After acertain number of cycles, all possible hybrids will have been achievedand further cycles are redundant.

In an embodiment the same mutated template nucleic acid is repeatedlyrecombined and the resulting recombinants selected for the desiredcharacteristic.

4.16.5.5.6 Cloning into a Vector Capable of Replicating in a Bacteria

Therefore, the initial pool or population of mutated template nucleicacid is cloned into a vector capable of replicating in a bacteria suchas E. colil. The particular vector is not essential, so long as it iscapable of autonomous replication in E. coli. In a preferred embodiment,the vector is designed to allow the expression and production of anyprotein encoded by the mutated specific nucleic acid linked to thevector. It is also preferred that the vector contain a gene encoding fora selectable marker.

The population of vectors containing the pool of mutated nucleic acidsequences is introduced into the E. coli host cells. The vector nucleicacid sequences may be introduced by transformation, transfection orinfection in the case of phage. The concentration of vectors used totransform the bacteria is such that a number of vectors is introducedinto each cell. Once present in the cell, the efficiency of homologousrecombination is such that homologous recombination occurs between thevarious vectors. This results in the generation of hybrids (daughters)having a combination of mutations which differ from the original parentmutated sequences.

The host cells are then clonally replicated and selected for the markergene present on the vector. Only those cells having a plasmid will growunder the selection.

4.16.5.5.7 Testing for the Presence of Favorable Mutations

The host cells which contain a vector are then tested for the presenceof favorable mutations. Such testing may consist of placing the cellsunder selective pressure, for example, if the gene to be selected is animproved drug resistance gene. If the vector allows expression of theprotein encoded by the mutated nucleic acid sequence, then suchselection may include allowing expression of the protein so encoded,isolation of the protein and testing of the protein to determinewhether, for example, it binds with increased efficiency to the ligandof interest.

4.16.5.5.8 Isolation of the Desired Nucleic Acid Sequence

Once a particular daughter mutated nucleic acid sequence has beenidentified which confers the desired characteristics, the nucleic acidis isolated either already linked to the vector or separated from thevector. This nucleic acid is then mixed with the first or parentpopulation of nucleic acids and the cycle is repeated.

It has been shown that by this method nucleic acid sequences havingenhanced desired properties can be selected.

4.16.5.5.9 Addition of Parental Mutated Sequences to the CellsContaining the First Generation of Hybrids

In an alternate embodiment, the first generation of hybrids are retainedin the cells and the parental mutated sequences are added again to thecells. Accordingly, the first cycle of Embodiment I is conducted asdescribed above. However, after the daughter nucleic acid sequences areidentified, the host cells containing these sequences are retained.

The parent mutated specific nucleic acid population, either aspolynucleotides or cloned into the same vector is introduced into thehost cells already containing the daughter nucleic acids. Recombinationis allowed to occur in the cells and the next generation ofrecombinants, or granddaughters are selected by the methods describedabove.

This cycle can be repeated a number of times until the nucleic acid orpeptide having the desired characteristics is obtained. It iscontemplated that in subsequent cycles, the population of mutatedsequences which are added to the preferred hybrids may come from theparental hybrids or any subsequent generation.

4.16.5.5.10 “Molecular” Backcross to Eliminate Any Neutral Mutations

In an alternative embodiment, the invention provides a method ofconducting a “molecular” backcross of the obtained recombinant specificnucleic acid in order to eliminate any neutral mutations. Neutralmutations are those mutations which do not confer onto the nucleic acidor peptide the desired properties. Such mutations may however confer onthe nucleic acid or peptide undesirable characteristics. Accordingly, itis desirable to eliminate such neutral mutations. The method of thisinvention provide a means of doing so.

In this embodiment, after the hybrid nucleic acid, having the desiredcharacteristics, is obtained by the methods of the embodiments, thenucleic acid, the vector having the nucleic acid or the host cellcontaining the vector and nucleic acid is isolated.

The nucleic acid or vector is then introduced into the host cell with alarge excess of the wild-type nucleic acid. The nucleic acid of thehybrid and the nucleic acid of the wild-type sequence are allowed torecombine. The resulting recombinants are placed under the sameselection as the hybrid nucleic acid. Only those recombinants whichretained the desired characteristics will be selected. Any silentmutations which do not provide the desired characteristics will be lostthrough recombination with the wild-type DNA. This cycle can be repeateda number of times until all of the silent mutations are eliminated.

Thus the methods of this invention can be used in a molecular backcrossto eliminate unnecessary or silent mutations.

4.16.5.6 Utility

The in vivo recombination method of this invention can be performedblindly on a pool of unknown hybrids or alleles of a specificpolynucleotide or sequence. However, it is not necessary to know theactual DNA or RNA sequence of the specific polynucleotide.

The approach of using recombination within a mixed population of genescan be useful for the generation of any useful proteins, for example,interleukin 1, antibodies, tPA and growth hormone. This approach may beused to generate proteins having altered specificity or activity. Theapproach may also be useful for the generation of hybrid nucleic acidsequences, for example, promoter regions, introns, exons, enhancersequences, 31 untranslated regions or 51 untranslated regions of genes.Thus this approach may be used to generate genes having increased ratesof expression. This approach may also be useful in the study ofrepetitive DNA sequences. Finally, this approach may be useful to mutateribozymes or aptamers.

Scaffold-like regions separating regions of diversity in proteins may beparticularly suitable for the methods of this invention. The conservedscaffold determines the overall folding by self-association, whiledisplaying relatively unrestricted loops that mediate the specificbinding. Examples of such scaffolds are the immunoglobulin beta barrel,and the four-helix bundle. The methods of this invention can be used tocreate scaffold-like proteins with various combinations of mutatedsequences for binding.

The equivalents of some standard genetic matings may also be performedby the methods of this invention. For example, a “molecular” backcrosscan be performed by repeated mixing of the hybrid's nucleic acid withthe wild-type nucleic acid while selecting for the mutations ofinterest. As in traditional breeding, this approach can be used tocombine phenotypes from different sources into a background of choice.It is useful, for example, for the removal of neutral mutations thataffect unselected characteristics (i.e. immunogenicity). Thus it can beuseful to determine which mutations in a protein are involved in theenhanced biological activity and which are not.

4.16.5.7 Peptide Display Methods

The present method can be used to shuffle, by in vitro and/or in vivorecombination by any of the disclosed methods, and in any combination,polynucleotide sequences selected by peptide display methods, wherein anassociated polynucleotide encodes a displayed peptide which is screenedfor a phenotype (e.g., for affinity for a predetermined receptor(ligand).

An increasingly important aspect of bio-pharmaceutical drug developmentand molecular biology is the identification of peptide structures,including the primary amino acid sequences, of peptides orpeptidomimetics that interact with biological macromolecules. one methodof identifying peptides that possess a desired structure or functionalproperty, such as binding to a predetermined biological macromolecule(e.g., a receptor), involves the screening of a large library orpeptides for individual library members which possess the desiredstructure or functional property conferred by the amino acid sequence ofthe peptide.

In addition to direct chemical synthesis methods for generating peptidelibraries, several recombinant DNA methods also have been reported. Onetype involves the display of a peptide sequence, antibody, or otherprotein on the surface of a bacteriophage particle or cell. Generally,in these methods each bacteriophage particle or cell serves as anindividual library member displaying a single species of displayedpeptide in addition to the natural bacteriophage or cell proteinsequences. Each bacteriophage or cell contains the nucleotide sequenceinformation encoding the particular displayed peptide sequence; thus,the displayed peptide sequence can be ascertained by nucleotide sequencedetermination of an isolated library member.

A well-known peptide display method involves the presentation of apeptide sequence on the surface of a filamentous bacteriophage,typically as a fusion with a bacteriophage coat protein. Thebacteriophage library can be incubated with an immobilized,predetermined macromolecule or small molecule (e.g., a receptor) so thatbacteriophage particles which present a peptide sequence that binds tothe immobilized macromolecule can be differentially partitioned fromthose that do not present peptide sequences that bind to thepredetermined macromolecule. The bacteriophage particles (i.e., librarymembers) which are bound to the immobilized macromolecule are thenrecovered and replicated to amplify the selected bacteriophagesub-population for a subsequent round of affinity enrichment and phagereplication. After several rounds of affinity enrichment and phagereplication, the bacteriophage library members that are thus selectedare isolated and the nucleotide sequence encoding the displayed peptidesequence is determined, thereby identifying the sequence(s) of peptidesthat bind to the predetermined macromolecule (e.g., receptor). Suchmethods are further described in PCT patent publication Nos. 91/17271,91/18980, and 91/19818 and 93/08278.

The latter PCT publication describes a recombinant DNA method for thedisplay of peptide ligands that involves the production of a library offusion proteins with each fusion protein composed of a first polypeptideportion, typically comprising a variable sequence, that is available forpotential binding to a predetermined macromolecule, and a secondpolypeptide portion that binds to DNA, such as the DNA vector encodingthe individual fusion protein. When transformed host cells are culturedunder conditions that allow for expression of the fusion protein, thefusion protein binds to the DNA vector encoding it. Upon lysis of thehost cell, the fusion protein/vector DNA complexes can be screenedagainst a predetermined macromolecule in much the same way asbacteriophage particles are screened in the phage-based display system,with the replication and sequencing of the DNA vectors in the selectedfusion protein/vector DNA complexes serving as the basis foridentification of the selected library peptide sequence(s).

4.16.5.7.1 Hybrid Methods for Generating Libraries of Peptides and LikePolymers

Other systems for generating libraries of peptides and like polymershave aspects of both the recombinant and in vitro chemical synthesismethods. In these hybrid methods, cell-free enzymatic machinery isemployed to accomplish the in vitro synthesis of the library members(i.e., peptides or polynucleotides). In one type of method, RNAmolecules with the ability to bind a predetermined protein or apredetermined dye molecule were selected by alternate rounds ofselection and PCR amplification (Tuerk and Gold (1990) Science 249: 505;Ellington and Szostak (1990) Nature 346: 818). A similar technique wasused to identify DNA sequences which bind a predetermined humantranscription factor (Thiesen and Bach (1990) Nucleic Acids Res. 18:3203; Beaudry and Joyce (1992) Science 257; 635; PCT patent publicationNos. 92/05258 and 92/14843). In a similar fashion, the technique of invitro translation has been used to synthesize proteins of interest andhas been proposed as a method for generating large libraries ofpeptides. These methods which rely upon in vitro translation, generallycomprising stabilized polysome complexes, are described further in PCTpatent publication Nos. 88/08453, 90/05785, 90/07003, 91/02076,91/05058, and 92/02536. Applicants have described methods in whichlibrary members comprise a fusion protein having a first polypeptideportion with DNA binding activity and a second polypeptide portionhaving the library member unique peptide sequence; such methods aresuitable for use in cell-free in vitro selection formats, among others.

4.16.5.7.2 The Displayed Peptide Sequences

The displayed peptide sequences can be of varying lengths, typicallyfrom 3-5000 amino acids long or longer, frequently from 5-100 aminoacids long, and often from about 8-15 amino acids long. A library cancomprise library members having varying lengths of displayed peptidesequence, or may comprise library members having a fixed length ofdisplayed peptide sequence. Portions or all of the displayed peptidesequence(s) can be random, pseudorandom, defined set kernal, fixed, orthe like. The present display methods include methods for in vitro andin vivo display of single-chain antibodies, such as nascent scFv onpolysomes or scfv displayed on phage, which enable large-scale screeningof scfv libraries having broad diversity of variable region sequencesand binding specificities.

4.16.5.7.3 Sequence Framework Peptide Libraries

The present invention also provides random, pseudorandom, and definedsequence framework peptide libraries and methods for generating andscreening those libraries to identify useful compounds (e.g., peptides,including single-chain antibodies) that bind to receptor molecules orepitopes of interest or gene products that modify peptides or RNA in adesired fashion. The random, pseudorandom, and defined sequenceframework peptides are produced from libraries of peptide librarymembers that comprise displayed peptides or displayed single-chainantibodies attached to a polynucleotide template from which thedisplayed peptide was synthesized. The mode of attachment may varyaccording to the specific embodiment of the invention selected, and caninclude encapsulation in a phage particle or incorporation in a cell.

4.16.5.7.4 Selecting for the Desired Peptide Using Affinity Enrichment

A method of affinity enrichment allows a very large library of peptidesand single-chain antibodies to be screened and the polynucleotidesequence encoding the desired peptide(s) or single-chain antibodies tobe selected. The polynucleotide can then be isolated and shuffled torecombine combinatorially the amino acid sequence of the selectedpeptide(s) (or predetermined portions thereof) or single-chainantibodies (or just VHI, VLI or CDR portions thereof). Using thesemethods, one can identify a peptide or single-chain antibody as having adesired binding affinity for a molecule and can exploit the process ofshuffling to converge rapidly to a desired high-affinity peptide orscfv. The peptide or antibody can then be synthesized in bulk byconventional means for any suitable use (e.g., as a therapeutic ordiagnostic agent).

A significant advantage of the present invention is that no priorinformation regarding an expected ligand structure is required toisolate peptide ligands or antibodies of interest. The peptideidentified can have biological activity, which is meant to include atleast specific binding affinity for a selected receptor molecule and, insome instances, will further include the ability to block the binding ofother compounds, to stimulate or inhibit metabolic pathways, to act as asignal or messenger, to stimulate or inhibit cellular activity, and thelike.

4.16.5.7.5 Shuffling Sequences Selected by Affinity Screening

The present invention also provides a method for shuffling a pool ofpolynucleotide sequences selected by affinity screening a library ofpolysomes displaying nascent peptides (including single-chainantibodies) for library members which bind to a predetermined receptor(e.g., a mammalian proteinaceous receptor such as, for example, apeptidergic hormone receptor, a cell surface receptor, an intracellularprotein which binds to other protein(s) to form intracellular proteincomplexes such as hetero-dimers and the like) or epitope (e.g., animmobilized protein, glycoprotein, oligosaccharide, and the like).

Polynucleotide sequences selected in a first selection round (typicallyby affinity selection for binding to a receptor (e.g., a ligand)) by anyof these methods are pooled and the pool(s) is/are shuffled by in vitroand/or in vivo recombination to produce a shuffled pool comprising apopulation of recombined selected polynucleotide sequences. Therecombined selected polynucleotide sequences are subjected to at leastone subsequent selection round. The polynucleotide sequences selected inthe subsequent selection round(s) can be used directly, sequenced,and/or subjected to one or more additional rounds of shuffling andsubsequent selection. Selected sequences can also be back-crossed withpolynucleotide sequences encoding neutral sequences (i.e., havinginsubstantial functional effect on binding), such as for example byback-crossing with a wild-type or naturally-occurring sequencesubstantially identical to a selected sequence to produce native-likefunctional peptides, which may be less immunogenic. Generally, duringback-crossing subsequent selection is applied to retain the property ofbinding to the predetermined receptor (ligand).

Prior to or concomitant with the shuffling of selected sequences, thesequences can be mutagenized. In one embodiment, selected librarymembers are cloned in a prokaryotic vector (e.g., plasmid, phagemid, orbacteriophage) wherein a collection of individual colonies (or plaques)representing discrete library members are produced. Individual selectedlibrary members can then be manipulated (e.g., by site-directedmutagenesis, cassette mutagenesis, chemical mutagenesis, PCRmutagenesis, and the like) to generate a collection of library membersrepresenting a kernal of sequence diversity based on the sequence of theselected library member. The sequence of an individual selected librarymember or pool can be manipulated to incorporate random mutation,pseudorandom mutation, defined kernal mutation (i.e., comprising variantand invariant residue positions and/or comprising variant residuepositions which can comprise a residue selected from a defined subset ofamino acid residues), codon-based mutation, and the like, eithersegmentally or over the entire length of the individual selected librarymember sequence. The mutagenized selected library members are thenshuffled by in vitro and/or in vivo recombinatorial shuffling asdisclosed herein.

4.16.5.7.6 Peptide Libraries Comprising a Plurality of IndividualLibrary Members

The invention also provides peptide libraries comprising a plurality ofindividual library members of the invention, wherein (1) each individuallibrary member of said plurality comprises a sequence produced byshuffling of a pool of selected sequences, and (2) each individuallibrary member comprises a variable peptide segment sequence orsingle-chain antibody segment sequence which is distinct from thevariable peptide segment sequences or single-chain antibody sequences ofother individual library members in said plurality (although somelibrary members may be present in more than one copy per library due touneven amplification, stochastic probability, or the like).

4.16.5.7.7 Product-by-Process

The invention also provides a product-by-process, wherein selectedpolynucleotide sequences having (or encoding a peptide having) apredetermined binding specificity are formed by the process of: (1)screening a displayed peptide or displayed single-chain antibody libraryagainst a predetermined receptor (e.g., ligand) or epitope (e.g.,antigen macromolecule) and identifying and/or enriching library memberswhich bind to the predetermined receptor or epitope to produce a pool ofselected library members, (2) shuffling by recombination the selectedlibrary members (or amplified or cloned copies thereof) which binds thepredetermined epitope and has been thereby isolated and/or enriched fromthe library to generate a shuffled library, and (3) screening theshuffled library against the predetermined receptor (e.g., ligand) orepitope (e.g., antigen macromolecule) and identifying and/or enrichingshuffled library members which bind to the predetermined receptor orepitope to produce a pool of selected shuffled library members.

4.16.5.8 Antibody Display and Screening Methods

The present method can be used to shuffle, by in vitro and/or in vivorecombination by any of the disclosed methods, and in any combination,polynucleotide sequences selected by antibody display methods, whereinan associated polynucleotide encodes a displayed antibody which isscreened for a phenotype (e.g., for affinity for binding a predeterminedantigen (ligand).

Various molecular genetic approaches have been devised to capture thevast immunological repertoire represented by the extremely large numberof distinct variable regions which can be present in immunoglobulinchains. The naturally-occurring germ line immunoglobulin heavy chainlocus is composed of separate tandem arrays of variable segment geneslocated upstream of a tandem array of diversity segment genes, which arethemselves located upstream of a tandem array of joining (i) regiongenes, which are located upstream of the constant region genes. During Blymphocyte development, V-D-J rearrangement occurs wherein a heavy chainvariable region gene (VH) is formed by rearrangement to form a fused Dsegment followed by rearrangement with a V segment to form a V-D-Jjoined product gene which, if productively rearranged, encodes afunctional variable region (VH) of, a heavy chain. Similarly, lightchain loci rearrange one of several V segments with one of several Jsegments to form a gene encoding the variable region (VL) of a lightchain.

4.16.5.8.1 Sequence Diversity

The vast repertoire of variable regions possible in immunoglobulinsderives in part from the numerous combinatorial possibilities of joiningV and i segments (and, in the case of heavy chain loci, D segments)during rearrangement in B cell development. Additional sequencediversity in the heavy chain variable regions arises from non-uniformrearrangements of the D segments during V-D-J joining and from N regionaddition. Further, antigen-selection of specific B cell clones selectsfor higher affinity variants having non-germline mutations in one orboth of the heavy and light chain variable regions; a phenomenonreferred to as “affinity maturation” or “affinity sharpening”.Typically, these “affinity sharpening” mutations cluster in specificareas of the variable region, most commonly in thecomplementarity-determining regions (CDRs).

4.16.5.8.2 Prokaryotic Epression Systems

In order to overcome many of the limitations in producing andidentifying high-affinity immunoglobulins through antigen-stimulated Bcell development (i.e., immunization), various prokaryotic expressionsystems have been developed that can be manipulated to producecombinatorial antibody libraries which may be screened for high-affinityantibodies to specific antigens. Recent advances in the expression ofantibodies in Escherichia coli and bacteriophage systems (see,“Alternative Peptide Display Methods”, infra) have raised thepossibility that virtually any specificity can be obtained by eithercloning antibody genes from characterized hybridomas or by de novoselection using antibody gene libraries (e.g., from Ig cDNA).

Combinatorial libraries of antibodies have been generated inbacteriophage lambda expression systems which may be screened asbacteriophage plaques or as colonies of lysogens (Huse et al. (1989)Science 246: 1275; Caton and Koprowski (1990) Proc. Natl. Acad. Sci.(U.S.A.) 87: 6450; Mullinax et al (1990) Proc. Natl. Acad. Sci. (U.S.A.)87: 8095; Persson et al. (1991) Proc. Natl. Acad. Sci. (U.S.A.) 88:2432). Various embodiments of bacteriophage antibody display librariesand lambda phage expression libraries have been described (Kang et al.(1991) Proc. Natl. Acad. Sci. (U.S.A.) 88: 4363; Clackson et al. (1991)Nature 352: 624; McCafferty et al. (1990) Nature 348: 552; Burton et al.(1991) Proc. Natl Acad. Sci. (U.S.A.) 88: 10134; Hoogenboom et al.(1991) Nucleic Acids Res. 19: 4133; Chang et al. (1991) J. Immunol. 147:3610; Breitling et al. (1991) Gene 104: 147; Marks et al. (1991) J. Mol.Biol. 222%: 581; Barbas et al. (1992) Proc. Natl. Acad. Sci. (U.S.A.)89: 4457; Hawkins and Winter (1992) J. Immunol. 22: 867; Marks et al.(1992) Biotechnology 10: 779; Marks et al. (1992) J. Biol. Chem. 267:16007; Lowman et al (1991) Biochemistry 30: 10832; Lerner et al. (1992)Science. 258: 1313, incorporated herein by reference). Typically, abacteriophage antibody display library is screened with a receptor(e.g., polypeptide, carbohydrate, glycoprotein, nucleic acid) that isimmobilized (e.g., by covalent linkage to a chromatography resin toenrich for reactive phage by affinity chromatography) and/or labeled(e.g., to screen plaque or colony lifts).

4.16.5.8.3 Single-Chain Fragment Variable Libraries

One particularly advantageous approach has been the use of so-calledsingle-chain fragment variable (scfv) libraries (Marks et al. (1992)Biotechnology 10: 779; Winter G and Milstein C (1991) Nature 349: 293;Clackson et al. (1991) op. cit.; Marks et al. (1991) J. Mol. Biol. 222:581; Chaudhary et al. (1990) Proc. Natl. Acad. Sci. (USA) 87: 1066;Chiswell et al. (1992) TIBTECH 10: 80; McCafferty et al. (1990) op.cit.;and Huston et al-(1988) Proc. Natl. Acad. Sci. (USA) 85: 5879). Variousembodiments of scfv libraries displayed on bacteriophage coat proteinshave been described.

Beginning in 1988, single-chain analogues of Fv fragments and theirfusion proteins have been reliably generated by antibody engineeringmethods. The first step generally involves obtaining the genes encodingVH and VL domains with desired binding properties; these V genes may beisolated from a specific hybridoma cell line, selected from acombinatorial V-gene library, or made by V gene synthesis. Thesingle-chain Fv is formed by connecting the component V genes with anoligonucleotide that encodes an appropriately designed linker peptide,such as (Gly-Gly-Gly-Gly-Ser)3 or equivalent linker peptide(s). Thelinker bridges the C-terminus of the first V region and N-terminus ofthe second, ordered as either VH-linker-VL or VL-linker-VH′ Inprinciple, the scfv binding site can faithfully replicate both theaffinity and specificity of its parent antibody combining site.

Thus, scfv fragments are comprised of VH and VL domains linked into asingle polypeptide chain by a flexible linker peptide. After the scfvgenes are assembled, they are cloned into a phagemid and expressed atthe tip of the M13 phage (or similar filamentous bacteriophage) asfusion proteins with the bacteriophage Pill (gene 3) coat protein.Enriching for phage expressing an antibody of interest is accomplishedby panning the recombinant phage displaying a population scfv forbinding to a predetermined epitope (e.g., target antigen, receptor).

The linked polynucleotide of a library member provides the basis forreplication of the library member after a screening or selectionprocedure, and also provides the basis for the determination, bynucleotide sequencing, of the identity of the displayed peptide sequenceor VH and VL amino acid sequence. The displayed peptide (s) orsingle-chain antibody (e.g., scfv) and/or its VH and VL domains or theirCDRs can be cloned and expressed in a suitable expression system. oftenpolynucleotides encoding the isolated VH and VL domains will be ligatedto polynucleotides encoding constant regions (CH and CL) to formpolynucleotides encoding complete antibodies (e.g., chimeric orfully-human), antibody fragments, and the like. Often polynucleotidesencoding the isolated CDRs will be grafted into polynucleotides encodinga suitable variable region framework (and optionally constant regions)to form polynucleotides encoding complete antibodies (e.g., humanized orfully-human), antibody fragments, and the like. Antibodies can be usedto isolate preparative quantities of the antigen by immunoaffinitychromatography. Various other uses of such antibodies are to diagnoseand/or stage disease (e.g., neoplasia) and for therapeutic applicationto treat disease, such as for example: neoplasia, autoimmune disease,AIDS, cardiovascular disease, infections, and the like.

4.16.5.8.4 Increasing the Combinatorial Diversity of a SCFV Library

Various methods have been reported for increasing the combinatorialdiversity of a scfv library to broaden the repertoire of binding species(idiotype spectrum) The use of PCR has permitted the variable regions tobe rapidly cloned either from a specific hybridoma source or as a genelibrary from non-immunized cells, affording combinatorial diversity inthe assortment of VH and VL cassettes which can be combined.Furthermore, the VH and VL cassettes can themselves be diversified, suchas by random, pseudorandom, or directed mutagenesis. Typically, VH andVL cassettes are diversified in or near the complementarity-determiningregions (CDRS), often the third CDR, CDR3. Enzymatic inverse PCRmutagenesis has been shown to be a simple and reliable method forconstructing relatively large libraries of scfv site-directed hybrids(Stemmer et al. (1993) Biotechniques 14: 256), as has error-prone PCRand chemical mutagenesis (Deng et al. (1994) J. Biol. Chem. 269: 953 3).Riechmann et al. (1993) Biochemistry 32: 8848 showed semi-rationaldesign of an antibody scfv fragment using site-directed randomization bydegenerate oligonucleotide PCR and subsequent phage display of theresultant scfv hybrids. Barbas et al. (1992) on.cit. attempted tocircumvent the problem of limited repertoire sizes resulting from usingbiased variable region sequences by randomizing the sequence in asynthetic CDR region of a human tetanus toxoid-binding Fab.

CDR randomization has the potential to create approximately 1×10²⁰ CDRsfor the heavy chain CDR3 alone, and a roughly similar number of variantsof the heavy chain CDR1 and CDR2, and light chain CDR1-3 variants. Takenindividually or together, the combination possibilities of CDRrandomization of heavy and/or light chains requires generating aprohibitive number of bacteriophage clones to produce a clone libraryrepresenting all possible combinations, the vast majority of which willbe non-binding. Generation of such large numbers of primarytransformants is not feasible with current transformation technology andbacteriophage display systems. For example, Barbas et al. (1992) op.cit.only generated 5×10⁷ transformants, which represents only a tinyfraction of the potential diversity of a library of thoroughlyrandomized CDRS.

Despite these substantial limitations, bacteriophage. display of scfvhave already yielded a variety of useful antibodies and antibody fusionproteins. A bispecific single chain antibody has been shown to mediateefficient tumor cell lysis (Gruber et al. (1994) J. Immunol. 152: 5368).Intracellular expression of an anti-Rev scfv has been shown to inhibitHIV-1 virus replication in vitro (Duan et al. (1994) Proc. Natl. Acad.Sci. (USA) 91: 5075), and intracellular expression of an anti-p21rar,scfv has been shown to inhibit meiotic maturation of Xenopus oocytes(Biocca et al. (1993) Biochem. Bioshys. Res. Commun. 197: 422.Recombinant scfv which can be used to diagnose HIV infection have alsobeen reported, demonstrating the diagnostic utility of scfv (Lilley etal. (1994) J. Immunol. Meth. 171: 211). Fusion proteins wherein an scFvis linked to a second polypeptide, such as a toxin or fibrinolyticactivator protein, have also been reported (Holvost et al. (1992) Eur.J. Biochess. 210: 945; Nicholls et al. (1993) J. Biol. Chem. 268: 5302).

4.16.5.8.5 Use of in vitro and in vivo Shuffling Methods to RecombineCDRs

If it were possible to generate scfv libraries having broader antibodydiversity and overcoming many of the limitations of conventional CDRmutagenesis and randomization methods which can cover only a very tinyfraction of the potential sequence combinations, the number and qualityof scfv antibodies suitable for therapeutic and diagnostic use could bevastly improved. To address this, the in vitro and in vivo shufflingmethods of the invention are used to recombine CDRs which have beenobtained (typically via PCR amplification or cloning) from nucleic acidsobtained from selected displayed antibodies. Such displayed antibodiescan be displayed on cells, on bacteriophage particles, on polysomes, orany suitable antibody display system wherein the antibody is associatedwith its encoding nucleic acid(s). In a variation, the CDRs areinitially obtained from MRNA (or CDNA) from antibody-producing cells(e.g., plasma cells/splenocytes from an immunized wild-type mouse, ahuman, or a transgenic mouse capable of making a human antibody as inWO92/03918, WO93/12227, and WO94/25585), including hybridomas derivedtherefrom.

Polynucleotide sequences selected in a first selection round (typicallyby affinity selection for displayed antibody binding to an antigen(e.g., a ligand) by any of these methods are pooled and the pool(s)is/are shuffled by in vitro and/or in vivo recombination, especiallyshuffling of CDRs (typically shuffling heavy chain CDRs with other heavychain CDRs and light chain CDRs with other light chain CDRS) to producea shuffled pool comprising a population of recombined selectedpolynucleotide sequences. The recombined selected polynucleotidesequences are expressed in a selection format as a displayed antibodyand subjected to at least one subsequent selection round. Thepolynucleotide sequences selected in the subsequent selection round(s)can be used directly, sequenced, and/or subjected to one or moreadditional rounds of shuffling and subsequent selection until anantibody of the desired binding affinity is obtained. Selected sequencescan also be back-crossed with polynucleotide sequences encoding neutralantibody framework sequences (i.e., having insubstantial functionaleffect on antigen binding), such as for example by back-crossing with ahuman variable region framework to produce human-like sequenceantibodies. Generally, during back-crossing subsequent selection isapplied to retain the property of binding to the predetermined antigen.

4.16.5.8.6 Controlling the Average Binding Affinity of Selected SCFVLibrary Members

Alternatively, or in combination with the noted variations, the valencyof the target epitope may be varied to control the average bindingaffinity of selected scfv library members. The target epitope can bebound to a surface or substrate at varying densities, such as byincluding a competitor epitope, by dilution, or by other method known tothose in the art. A high density (valency) of predetermined epitope canbe used to enrich for scfv library members which have relatively lowaffinity, whereas a low density (valency) can preferentially enrich forhigher affinity scfv library members.

4.16.5.8.7 Generating Diverse Variable Segments

For generating diverse variable segments, a collection of syntheticoligonucleotides encoding random, pseudorandom, or a defined sequencekernal set of peptide sequences can be inserted by ligation into apredetermined site (e.g., a CDR). Similarly, the sequence diversity ofone or more CDRs of the single-chain antibody cassette(s) can beexpanded by mutating the CDR(s) with site-directed mutagenesis,CDR-replacement, and the like. The resultant DNA molecules can bepropagated in a host for cloning and amplification prior to shuffling,or can be used directly (i.e., may avoid loss of diversity which mayoccur upon propagation in a host cell) and the selected library memberssubsequently shuffled.

Displayed peptide/polynucleotide complexes (library members) whichencode a variable segment peptide sequence of interest or a single-chainantibody of interest are selected from the library by an affinityenrichment technique. This is accomplished by means of a immobilizedmacromolecule or epitope specific for the peptide sequence of interest,such as a receptor, other macromolecule, or other epitope species.Repeating the affinity selection procedure provides an enrichment oflibrary members encoding the desired sequences, which may then beisolated for pooling and shuffling, for sequencing, and/or for furtherpropagation and affinity enrichment.

The library members without the desired specificity are removed bywashing. The degree and stringency of washing required will bedetermined for each peptide sequence or single-chain antibody ofinterest and the immobilized predetermined macromolecule or epitope. Acertain degree of control can be exerted over the bindingcharacteristics of the nascent peptide/DNA complexes recovered byadjusting the conditions of the binding incubation and the subsequentwashing. The temperature, pH, ionic strength, divalent cationsconcentration, and the volume and duration of the washing will selectfor nascent peptide/DNA complexes within particular ranges of affinityfor the immobilized macromolecule. Selection based on slow dissociationrate, which is usually predictive of high affinity, is often the mostpractical route. This may be done either by continued incubation in thepresence of a saturating amount of free predetermined macromolecule, orby increasing the volume, number, and length of the washes. In eachcase, the rebinding of dissociated nascent peptide/DNA or peptide/RNAcomplex is prevented, and with increasing time, nascent peptide/DNA orpeptide/RNA complexes of higher and higher affinity are recovered.

Additional modifications of the binding and washing procedures may beapplied to find peptides with special characteristics. The affinities ofsome peptides are dependent on ionic strength or cation concentration.This is a useful characteristic for peptides that will be used inaffinity purification of various proteins when gentle conditions forremoving the protein from the peptides are required.

One variation involves the use of multiple binding targets (multipleepitope species, multiple receptor species), such that a scfv librarycan be simultaneously screened for a multiplicity of scfv which havedifferent binding specificities. Given that the size of a scfv libraryoften limits the diversity of potential scfv sequences, it is typicallydesirable to us scfv libraries of as large a size as possible. The timeand economic considerations of generating a number of very largepolysome scFv-display libraries can become prohibitive. To avoid thissubstantial problem, multiple predetermined epitope species (receptorspecies) can be concomitantly screened in a single library, orsequential screening against a number of epitope species can be used. Inone variation, multiple target epitope species, each encoded on aseparate bead (or subset of beads), can be mixed and incubated with apolysome-display scfv library under suitable binding conditions. Thecollection of beads, comprising multiple epitope species, can then beused to isolate, by affinity selection, scfv library members. Generally,subsequent affinity screening rounds can include the same mixture ofbeads, subsets thereof, or beads containing only one or two individualepitope species. This approach affords efficient screening, and iscompatible with laboratory automation, batch processing, and highthroughput screening methods.

4.16.5.8.8 Techniques Used to Diversify a Peptide Library orSingle-Chain Antibody Library

A variety of techniques can be used in the present invention todiversify a peptide library or single-chain antibody library, or todiversify, prior to or concomitant with shuffling, around variablesegment peptides found in early rounds of panning to have sufficientbinding activity to the predetermined macromolecule or epitope. In oneapproach, the positive selected peptide/polynucleotide complexes (thoseidentified in an early round of affinity enrichment) are sequenced todetermine the identity of the active peptides. Oligonucleotides are thensynthesized based on these active peptide sequences, employing a lowlevel of all bases incorporated at each step to produce slightvariations of the primary oligonucleotide sequences. This mixture of(slightly) degenerate oligonucleotides is then cloned into the variablesegment sequences at the appropriate locations. This method producessystematic, controlled variations of the starting peptide sequences,which can then be shuffled. It requires, however, that individualpositive nascent peptide/polynucleotide complexes be sequenced beforemutagenesis, and thus is useful for expanding the diversity of smallnumbers of recovered complexes and selecting variants having higherbinding affinity and/or higher binding specificity. In a variation,mutagenic PCR amplification of positive selected peptide/polynucleotidecomplexes (especially of the variable region sequences, theamplification products of which are shuffled in vitro and/or in vivo andone or more additional rounds of screening is done prior to sequencing.The same general approach can be employed with single-chain antibodiesin order to expand the diversity and enhance the bindingaffinity/specificity, typically by diversifying CDRs or adjacentframework regions prior to or concomitant with shuffling. If desired,shuffling reactions can be spiked with mutagenic oligonucleotidescapable of in vitro recombination with the selected library members canbe included. Thus, mixtures of synthetic oligonucleotides and PCRproduced polynucleotides (synthesized by error-prone or high-fidelitymethods) can be added to the in vitro shuffling mix and be incorporatedinto resulting shuffled library members (shufflants).

4.16.5.8.9 Generation of a Library of CDR-Varient Single-ChainAntibodies

The present invention of shuffling enables the generation of a vastlibrary of CDR-variant single-chain antibodies. One way to generate suchantibodies is to insert synthetic CDRs into the single-chain antibodyand/or CDR randomization prior to or concomitant with shuffling. Thesequences of the synthetic CDR cassettes are selected by referring toknown sequence data of human CDR and are selected in the discretion ofthe practitioner according to the following guidelines: synthetic CDRswill have at least 40 percent positional sequence identity to known CDRsequences, and preferably will have at least 50 to 70 percent positionalsequence identity to known CDR sequences. For example, a collection ofsynthetic CDR sequences can be generated by synthesizing a collection ofoligonucleotide sequences on the basis of naturally-occurring human CDRsequences listed in Kabat et al. (1991) op cit.; the pool (s) ofsynthetic CDR sequences are calculated to encode CDR peptide sequenceshaving at least 40 percent sequence identity to at least one knownnaturally-occurring human CDR sequence. Alternatively, a collection ofnaturally-occurring CDR sequences may be compared to generate consensussequences so that amino acids used at a residue position frequently(i.e., in at least 5 percent of known CDR sequences) are incorporatedinto the synthetic CDRs at the corresponding position(s). Typically,several (e.g., 3 to about 50) known CDR sequences are compared andobserved natural sequence variations between the known CDRs aretabulated, and a collection of oligonucleotides encoding CDR peptidesequences encompassing all or most permutations of the observed naturalsequence variations is synthesized. For example but not for limitation,if a collection of human VH CDR sequences have carboxy-terminal aminoacids which are either Tyr, Val, Phe, or Asp, then the pool(s) ofsynthetic CDR oligonucleotide sequences are designed to allow thecarboxy-terminal CDR residue to be any of these amino acids. In someembodiments, residues other than those which naturally-occur at aresidue position in the collection of CDR sequences are incorporated:conservative amino acid substitutions are frequently incorporated and upto 5 residue positions may be varied to incorporate non-conservativeamino acid substitutions as compared to known naturally-occurring CDRsequences. Such CDR sequences can be used in primary library members(prior to first round screening) and/or can be used to spike in vitroshuffling reactions of selected library member sequences. Constructionof such pools of defined and/or degenerate sequences will be readilyaccomplished by those of ordinary skill in the art.

The collection of synthetic CDR sequences comprises at least one memberthat is not known to be a naturally-occurring CDR sequence. It is withinthe discretion of the practitioner to include or not include a portionof random or pseudorandom sequence corresponding to N region addition inthe heavy chain CDR; the N region sequence ranges from 1 nucleotide toabout 4 nucleotides occurring at V-D and D-J junctions. A collection ofsynthetic heavy chain CDR sequences comprises at least about 100 uniqueCDR sequences, typically at least about 1,000 unique CDR sequences,preferably at least about 10,000 unique CDR sequences, frequently morethan 50,000 unique CDR sequences; however, usually not more than about1×10 6 unique CDR sequences are included in the collection, althoughoccasionally 1×10⁷ to 1×10⁸ unique CDR sequences are present, especiallyif conservative amino acid substitutions are permitted at positionswhere the conservative amino acid substituent is not present or is rare(i.e., less than 0.1 percent) in that position in naturally-occurringhuman CDRS. In general, the number of unique CDR sequences included in alibrary should not exceed the expected number of primary transformantsin the library by more than a factor of 10. Such single-chain antibodiesgenerally bind of about at least 1×10 m-, preferably with an affinity ofabout at least 5×10 (superscript 7) M-1, more preferably with anaffinity of at least 1×10 (superscript 8) M-1 to 1×10 (superscript 9)M-1 or more, sometimes up to 1×10 (superscript 10) M-1 or more.Frequently, the predetermined antigen is a human protein, such as forexample a human cell surface antigen (e.g., CD4, CD8, IL-2 receptor, EGFreceptor, PDGF receptor), other human biological macromolecule (e.g.,thrombomodulin, protein C, carbohydrate antigen, sialyl Lewis antigen,Lselectin), or nonhuman disease associated macromolecule (e.g.,bacterial LPS, virion capsid protein or envelope glycoprotein) and thelike.

4.16.5.8.10 Expression Systems

High affinity single-chain antibodies of the desired specificity can beengineered and expressed in a variety of systems. For example, scfv havebeen produced in plants (Firek et al. (1993) Plant Mol. Biol. 23: 861)and can be readily made in prokaryotic systems (Owens R J and Young R J(1994) J. Immunol. Meth. 168: 149; Johnson S and Bird R E (1991) MethodsEnzymol 203: 88). Furthermore, the single-chain antibodies can be usedas a basis for constructing whole antibodies or various fragmentsthereof (Kettleborough et al. (1994) Eur. J. Immunol. 24: 952). Thevariable region encoding sequence may be isolated (e.g., by PCRamplification or subcloning) and spliced to a sequence encoding adesired human constant region to encode a human sequence antibody moresuitable for human therapeutic uses where immunogenicity is preferablyminimized. The polynucleotide(s) having the resultant fully humanencoding sequence(s) can be expressed in a host cell (e.g., from anexpression vector in a mammalian cell) and purified for pharmaceuticalformulation.

The DNA expression constructs will typically include an expressioncontrol DNA sequence operably linked to the coding sequences, includingnaturally-associated or heterologous promoter regions. Preferably, theexpression control sequences will be eukaryotic promoter systems invectors capable of transforming or transfecting eukaryotic host cells.Once the vector has been incorporated into the appropriate host, thehost is maintained under conditions suitable for high level expressionof the nucleotide sequences, and the collection and purification of themutant “engineered” antibodies.

As stated previously, the DNA sequences will be expressed in hosts afterthe sequences have been operably linked to an expression controlsequence (Le., positioned to ensure the transcription and translation ofthe structural gene). These expression vectors are typically replicablein the host organisms either as episomes or as an integral part of thehost chromosomal DNA. Commonly, expression vectors will containselection markers, e.g., tetracycline or neomycin, to permit detectionof those cells transformed with the desired DNA sequences (see, e.g.,U.S. Pat. No. 4,704,362, which is incorporated herein by reference).

4.16.5.8.11 Mammalian Tissue Cell Culture

In addition to eukaryotic microorganisms such as yeast, mammalian tissuecell culture may also be used to produce the polypeptides of the presentinvention (see, Winnacker, “From Genes to Clones,” VCH Publishers, N.,N.Y. (1987), which is incorporated herein by reference). Eukaryoticcells are actually preferred, because a number of suitable host celllines capable of secreting intact immunoglobulins have been developed inthe art, and include the CHO cell lines, various COS cell lines, HeLacells, and myeloma cell lines, but preferably transformed Bcells orhybridomas. Expression vectors for these cells can include expressioncontrol sequences, such as an origin of replication, a promoter, anenhancer (Queen et al. (1986) Immunol. Rev. 89: 49), and necessaryprocessing information sites, such as ribosome binding sites, RNA splicesites, polyadenylation sites, and transcriptional terminator sequences.Preferred expression control sequences are promoters derived fromimmunoglobulin genes, cytomegalovirus, SV40, Adenovirus, BovinePapilloma Virus, and the like.

Eukaryotic DNA transcription can be increased by inserting an enhancersequence into the vector. Enhancers are cis-acting sequences of between10 to 300 bp that increase transcription by a promoter. Enhancers caneffectively increase transcription when either 51 or 31 to thetranscription unit. They are also effective if located within an intronor within the coding sequence itself. Typically, viral enhancers areused, including SV40 enhancers, cytomegalovirus enhancers, polyomaenhancers, and adenovirus enhancers. Enhancer sequences from mammaliansystems are also commonly used, such as the mouse immunoglobulin heavychain enhancer.

Mammalian expression vector systems will also typically include aselectable marker gene. Examples of suitable markers include, thedihydrofolate reductase gene (DHFR), the thymidine kinase gene (TK), orprokaryotic genes conferring drug resistance. The first two marker genesprefer the use of mutant cell lines that lack the ability to growwithout the addition of thymidine to the growth medium. Transformedcells can then be identified by their ability to grow onnon-supplemented media. Examples of prokaryotic drug resistance genesuseful as markers include genes conferring resistance to G418,mycophenolic acid and hygromycin.

The vectors containing the DNA segments of interest can be transferredinto the host cell by well-known methods, depending on the type ofcellular host. For example, calcium chloride transfection is commonlyutilized for prokaryotic cells, whereas calcium phosphate treatment.lipofection, or electroporation may be used for other cellular hosts.Other methods used to transform mammalian cells include the use ofPolybrene, protoplast fusion, liposomes, electroporation, andmicro-injection (see, generally, Sambrook et al., supra).

Once expressed, the antibodies, individual mutated immunoglobulinchains, mutated antibody fragments, and other immunoglobulinpolypeptides of the invention can be purified according to standardprocedures of the art, including ammonium sulfate precipitation,fraction column chromatography, gel electrophoresis and the like (see,generally, Scopes, R., Protein Purification, Springer-Verlag, N.Y.(1982)). once purified, partially or to homogeneity as desired, thepolypeptides may then be used therapeutically or in developing andperforming assay procedures, immunofluorescent stainings, and the like(see, generally, Immunological Methods, Vols. I and II, Eds. Lefkovitsand Pemis, Academic Press, New York, N.Y. (1979 and 1981)).

The antibodies generated by the method of the present invention can beused for diagnosis and therapy. By way of illustration and notlimitation, they can be used to treat cancer, autoimmune diseases, orviral infections. For treatment of cancer, the antibodies will typicallybind to an antigen expressed preferentially on cancer cells, such aserbB-2, CEA, CD33, and many other antigens and binding members wellknown to those skilled in the art.

4.16.5.9 Yeast Two Hybrid Screening Assays

Shuffling can also be used to recombinatorially diversify a pool ofselected library members obtained by screening a two-hybrid screeningsystem to identify library members which bind a predeterminedpolypeptide sequence. The selected library members are pooled andshuffled by in vitro and/or in vivo recombination. The shuffled pool canthen be screened in a yeast two hybrid system to select library memberswhich bind said predetermined polypeptide sequence (e.g., and SH2domain) or which bind an alternate predetermined polypeptide sequence(e.g., an SH2 domain from another protein species).

An approach to identifying polypeptide sequences which bind to apredetermined polypeptide sequence has been to use a so-called“two-hybrid” system wherein the predetermined polypeptide sequence ispresent in a fusion protein (Chien et al. (1991) Proc. Natl. Acad. Sci.(USA) 88: 9578). This approach identifies protein-protein interactionsin vivo through reconstitution of a transcriptional activator (Fields Sand Song 0 (1989) Nature 340: 245), the yeast Gal4 transcriptionprotein. Typically, the method is based on the properties of the yeastGal4 protein, which consists of separable domains responsible forDNA-binding and transcriptional activation. Polynucleotides encoding twohybrid proteins, one consisting of the yeast Gal4 DNA-binding domainfused to a polypeptide sequence of a known protein and the otherconsisting of the Gal4 activation domain fused to a polypeptide sequenceof a second protein, are constructed and introduced into a yeast hostcell. Intermolecular binding between the two fusion proteinsreconstitutes the Gal4 DNA-binding domain with the Gal4 activationdomain, which leads to the transcriptional activation of a reporter gene(e.g., lacz, HIS3) which is operably linked to a Gal4 binding site.Typically, the two-hybrid method is used to identify novel polypeptidesequences which interact with a known protein (Silver S C and Hunt S W(1993) Mol. Biol. Rep. 17: 155; Durfee et al. (1993) Genes Devel. 7:555; Yang et al. (1992) Science 257: 680; Luban et al. (1993) Cell 73:1067; Hardy et al (1992) Genes Devel. 6; 801; Bartel et al. (1993)Biotechniques 14: 920; and Vojtek et al. (1993) Cell 74: 205). However,variations of the two-hybrid method have been used to identify mutationsof a known protein that affect its binding to a second known protein (LiB and Fields S (1993) FASEB J. 7: 957; Lalo et al. (1993) Proc. Natl.Acad. Sci. (USA) 90: 5524; Jackson et al. (1993) Mol. Cell. Biol. 13;2899; and Madura et al. (1993) J. Biol-Chem. 268: 12046). Two-hybridsystems have also been used to identify interacting structural domainsof two known proteins (Bardwell et al. (1993) med. Microbial. 8: 1177;Chakrabarty et al. (1992) J. Biol. Chem. 267: 17498; Staudinger et al.(1993) J. Biol. Chem. 268: 4608; and Milne G T. and Weaver D T (1993)Genes Devel. 7; 1755) or domains responsible for oligomerization of asingle protein (Iwabuchi et al. (1993) Oncogene 8; 1693; Bogerd et al.(1993) J. Virol. 67: 5030). Variations of two-hybrid systems have beenused to study the in vivo activity of a proteolytic enzyme (Dasmahapatraet al. (1992) Proc. Natl. Acad. Sci. (USA) 89: 4159). Alternatively, anE. coli/BCCP interactive screening system (Germino et al. (1993) Proc.Natl. Acad. Sci. (U.S.A.) 90: 933; Guarente L (1993) Proc. Natl. Acad.Sci. (U.S.A.) 90: 1639) can be used to identify interacting proteinsequences (i.e., protein sequences which heterodimerize or form higherorder heteromultimers). Sequences selected by a two-hybrid system can bepooled and shuffled and introduced into a two-hybrid system for one ormore subsequent rounds of screening to identify polypeptide sequenceswhich bind to the hybrid containing the predetermined binding sequence.The sequences thus identified can be compared to identify consensussequence(s) and consensus sequence kernals.

In general, standard techniques of recombination DNA technology aredescribed in various publications, e.g. Sambrook et al., 1989, MolecularCloning: A Laboratory Manual, Cold Spring Harbor Laboratory; Ausubel etal., 1987, Current Protocols in Molecular Biology, vols. 1 and 2 andsupplements, and Berger and Kimmel, Methods in Enzymology, Volume 152,Guide to Molecular Cloning Techniques (1987), Academic Press, Inc., SanDiego, Calif., each of which is incorporated herein in their entirety byreference. Polynucleotide modifying enzymes were used according to themanufacturers recommendations. Oligonucleotides were synthesized on anApplied Biosystems Inc. Model 394 DNA synthesizer using ABI chemicals.If desired, PCR amplimers for amplifying a predetermined DNA sequencemay be selected at the discretion of the practitioner.

4.16.5.9.1 Formation of Dimers

One microgram samples of template DNA are obtained and treated with U.V.light to cause the formation of dimers, including TT dimers,particularly purine dimers. U.V. exposure is limited so that only a fewphotoproducts are generated per gene on the template DNA sample.Multiple samples are treated with U.V. light for varying periods of timeto obtain template DNA samples with varying numbers of dimers from U.V.exposure.

4.16.5.9.2 Random Priming Kit

A random priming kit which utilizes a non-proofreading polymease (forexample, Prime-It II Random Primer Labeling kit by Stratagene CloningSystems) is utilized to generate different size polynucleotides bypriming at random sites on templates which are prepared by U.V. light(as described above) and extending along the templates. The primingprotocols such as described in the Prime-It II Random Primer Labelingkit may be utilized to extend the primers. The dimers formed by U.V.exposure serve as a roadblock for the extension by the non-proofreadingpolymerase. Thus, a pool of random size polynucleotides is present afterextension with the random primers is finished. 8.16.5.9.3 Generation ofa Selected Mutant Polynucleotide Sequence

The present invention is further directed to a method for generating aselected mutant polynucleotide sequence (or a population of selectedpolynucleotide sequences) typically in the form of amplified and/orcloned polynucleotides, whereby the selected polynucleotide sequences(s)possess at least one desired phenotypic characteristic (e.g., encodes apolypeptide, promotes transcription of linked polynucleotides, binds aprotein, and the like) which can be selected for. One method foridentifying hybrid polypeptides that possess a desired structure orfunctional property, such as binding to a predetermined biologicalmacromolecule (e.g., a receptor), involves the screening of a largelibrary of polypeptides for individual library members which possess thedesired structure or functional property conferred by the amino acidsequence of the polypeptide.

4.16.5.9.4 Generating Libraries Suitable for Affinity InteractionScreening or Phenotypic Screening

In one embodiment, the present invention provides a method forgenerating libraries of displayed polypeptides or displayed antibodiessuitable for affinity interaction screening or phenotypic screening. Themethod comprises (1) obtaining a first plurality of selected librarymembers comprising a displayed polypeptide or displayed antibody and anassociated polynucleotide encoding said displayed polypeptide ordisplayed antibody, and obtaining said associated polynucleotides orcopies thereof wherein said associated polynucleotides comprise a regionof substantially identical sequences, optimally introducing mutationsinto said polynucleotides or copies, (2) pooling the polynucleotides orcopies, (3) producing smaller or shorter polynucleotides by interruptinga random or particularized priming and synthesis process or anamplification process, and (4) performing amplification, preferably PCRamplification, and optionally mutagenesis to homologously recombine thenewly synthesized polynucleotides.

4.16.5.9.5 Producing Hybrid Polynucleotides which Express a UsefulHybrid Polypeptide

It is a particularly preferred object of the invention to provide aprocess for producing hybrid polynucleotides which express a usefulhybrid polypeptide by a series of steps comprising:

-   (a) producing polynucleotides by interrupting a polynucleotide    amplification or synthesis process with a means for blocking or    interrupting the amplification or synthesis process and thus    providing a plurality of smaller or shorter polynucleotides due to    the replication of the polynucleotide being in various stages of    completion;-   (b) adding to the resultant population of single- or double-stranded    polynucleotides one or more single- or double-stranded    oligonucleotides, wherein said added oligonucleotides comprise an    area of identity in an area of heterology to one or more of the    single- or double-stranded polynucleotides of the population;-   (c) denaturing the resulting single- or double-stranded    oligonucleotides to produce a mixture of single-stranded    polynucleotides, optionally separating the shorter or smaller    polynucleotides into pools of polynucleotides having various lengths    and further optionally subjecting said polynucleotides to a PCR    procedure to amplify one or more oligonucleotides comprised by at    least one of said polynucleotide pools;-   (d) incubating a plurality of said polynucleotides or at least one    pool of said polynucleotides with a polymerase under conditions    which result in annealing of said single-stranded polynucleotides at    regions of identity between the single-stranded polynucleotides and    thus forming of a mutagenized double-stranded polynucleotide chain;-   (e) optionally repeating steps (c) and (d);-   (f) expressing at least one hybrid polypeptide from said    polynucleotide chain, or chains; and-   (g) screening said at least one hybrid polypeptide for a useful    activity.

In a preferred aspect of the invention, the means for blocking orinterrupting the amplification or synthesis process is by utilization ofuv light, DNA adducts, DNA binding proteins.

In one embodiment of the invention, the DNA adducts, or polynucleotidescomprising the DNA adducts, are removed from the polynucleotides orpolynucleotide pool, such as by a process including heating the solutioncomprising the DNA fragments prior to further processing.

Having thus disclosed exemplary embodiments of the present invention, itshould be noted by those skilled in the art that the disclosures areexemplary only and that various other alternatives, adaptations andmodifications may be made within the scope of the present invention.Accordingly, the present invention is not limited to the specificembodiments as illustrated herein.

Without further elaboration, it is believed that one skilled in the artcan, using the preceding description, utilize the present invention toits fullest extent. The following examples are to be consideredillustrative and thus are not limiting of the remainder of thedisclosure in any way whatsoever.

5. ENGINEERING GOALS

5.1. General Overview: Successive Cycles of Recombination andScreening/Selection

The invention provides methods for artificially evolving cells toacquire a new or improved property by recursive sequence recombination.Briefly, recursive sequence recombination entails successive cycles ofrecombination to generate molecular diversity and screening/selection totake advantage of that molecular diversity. That is, a family of nucleicacid molecules is created showing substantial sequence and/or structuralidentity but differing as to the presence of mutations. These sequencesare then recombined in any of the described formats so as to optimizethe diversity of mutant combinations represented in the resultingrecombined library. Typically, any resulting recombinant nucleic acidsor genomes are recursively recombined for one or more cycles ofrecombination to increase the diversity of resulting products. Afterthis recursive recombination procedure, the final resulting products arescreened and/or selected for a desired trait or property.

Alternatively, each recombination cycle can be followed by at least onecycle of screening or selection for molecules having a desiredcharacteristic. In this embodiment, the molecule(s) selected in oneround form the starting materials for generating diversity in the nextround. The cells to be evolved can be bacteria, archaebacteria, oreukaryotic cells and can constitute a homogeneous cell line or mixedculture. Suitable cells for evolution include the bacterial andeukaryotic, cell lines commonly used in genetic engineering, proteinexpression, or the industrial production or conversion of proteins,enzymes, primary metabolites, secondary metabolites, fine, specialty orcommodity chemicals. Suitable mammalian cells include those from, e.g.,mouse, rat, hamster, primate, and human, both cell lines and primarycultures. Such cells include stem cells, including embryonic stem cellsand hemopoietic stem cells, zygotes, fibroblasts, lymphocytes, Chinesehamster ovary (CHO), mouse fibroblasts (NIM), kidney, liver, muscle, andskin cells. Other eukaryotic cells of interest include plant cells, suchas maize, rice, wheat, cotton, soybean, sugarcane, tobacco, andarabidopsis; fish, algae, fungi (penicillium, aspergillus, podospora,neurospora, saccharomyces), insect (e.g., baculo lepidoptera), yeast(picchia and saccharomyces, Schizosaccharomyces pombe). Also of interestare many bacterial cell types, both gram-negative and gram-positive,such as Bacillus subtilis, B. licehniformis, B. cereus, Escherichiacoli, Streptomyces, Pseudomonas, Salmonella, Actinomycetes,Lactobacillius, Acelonitcbacter, Deinococcus, and Erwinia. The completegenome sequences of E. coli and Bacillus subtilis are described byBlattner et al., Science 277, 1454-1462 (1997); Kunst et al., Nature390, 249-256 (1997).

5.2 Identification and Development of New and/or Improved Drugs

The genomics revolution, by determining the DNA sequences of greatnumbers of genes from many different organisms, has considerablybroadened the possibilities for drug discovery by identifying, largenumbers of molecules that are potential targets of drug action. One areaof drug development focusing upon generating new antimicrobial drugs.New antimicrobial drugs are needed to treat infections by drug resistantorganisms, and new methods are urgently needed to facilitate making suchdiscoveries. Technical advances in molecular biology, automated methodsfor high throughput screening and chem ical syntheses have led to anincrease in the number of target based screens utilized forantimicrobial drug discovery and in the number of compounds beinganalyzed.

The invention relates to procedures that can be applied to identifyingcompounds that bind to and modulate the function of target components ofa cell whose function is known or unknown, and cell components that arenot amenable to other screening methods. The invention relates togenerating and/or identifying a compound that binds to and modulates(inhibits or enhances) the function of a component of a cell, therebyproducing a phenotypic effect in the cell. Within these procedures aremethods for identifying a biomolecule that 1) binds to, in vitro, acomponent of a cell that has been isolated from other constituents ofthe cell and that 2) causes, in vivo, as seen in an assay uponintracellular expression of the biomolecule, a phenotypic effect in thecell which is the usual producer and host of the target cell component.In an assay demonstrating characteristic 2) above, intracellularproduction of the biomolecule can be in cells grown in culture or incells introduced into an animal. Further methods within these proceduresare those methods comprising an assay for a phenotypic effect in thecell upon intracellular production of the biomolecule, either in cellsin culture or in cells that have been introduced into one or moreanimals, and an assay to identify one or more compounds that behave ascompetitors of the biomolecule in an assay of binding to the target cellcomponent.

5.2.1. Procedure for Identifying and/or Designing Compounds withAntimicrobial Activity Against a Pathogen

The invention further relates to methods particularly well suited to aprocedure for identifying and/or designing compounds with antimicrobialactivity against a pathogen whose target cell component is the subjectof studies to identify such compounds. A common mechanism of action ofan antimicrobial agent is binding to a component of the cells of thepathogen treated with the antimicrobial.

The procedure includes methods for identifying biomolecules that bind toa chosen target in vitro, methods for identifying biomolecules that alsobind to the chosen target and modulate its function during infection ofa host mammal in vivo, and methods for identifying compounds thatcompete with the biomolecules for sites on the target in competitivebinding assays. Compounds identified by this procedure are candidatesfor drugs with antimicrobial activity against the pathogen.

5.3 Producing Proteins with Improved Affinities

Polynucleotide sequences selected in a first selection round (typicallyby affinity selection for binding to a receptor (e.g., a ligand)) by anyof these methods are pooled and the pool(s) is/are shuffled by in vitroand/or in vivo recombination to produce a shuffled pool comprising apopulation of recombined selected polynucleotide sequences. Therecombined selected polynucleotide sequences are subjected to at leastone subsequent selection round. The polynucleotide sequences selected inthe subsequent selection round(s) can be used directly, sequenced,and/or subjected to one or more additional rounds of shuffling andsubsequent selection. Selected sequences can also be back-crossed withpolynucleotide sequences encoding neutral sequences (i.e., havinginsubstantial functional effect on binding), such as for example byback-crossing with a wild-type or naturally-occurring sequencesubstantially identical to a selected sequence to produce native-likefunctional peptides, which may be less immunogenic. Generally, duringback-crossing subsequent selection is applied to retain the property ofbinding to the predetermined receptor (ligand).

Prior to or concomitant with the shuffling of selected sequences, thesequences can be mutagenized. In one embodiment, selected librarymembers are cloned in a prokaryotic vector (e.g., plasmid, phagemid, orbacteriophage) wherein a collection of individual colonies (or plaques)representing discrete library members are produced. Individual selectedlibrary members can then be manipulated (e.g., by site-directedmutagenesis, cassette mutagenesis, chemical mutagenesis, PCRmutagenesis, and the like) to generate a collection of library membersrepresenting a kernal of sequence diversity based on the sequence of theselected library member. The sequence of an individual selected librarymember or pool can be manipulated to incorporate random mutation,pseudorandom mutation, defined kernal mutation (i.e., comprising variantand invariant residue positions and/or comprising variant residuepositions which can comprise a residue selected from a defined subset ofamino acid residues), codon-based mutation, and the like, eithersegmentally or over the entire length of the individual selected librarymember sequence. The mutagenized selected library members are thenshuffled by in vitro and/or in vivo recombinatorial shuffling asdisclosed herein.

The invention also provides a product-by-process, wherein selectedpolynucleotide sequences having (or encoding a peptide having) apredetermined binding specificity are formed by the process of: (1)screening a displayed peptide or displayed single-chain antibody libraryagainst a predetermined receptor (e.g., ligand) or epitope (e.g.,antigen macromolecule) and identifying and/or enriching library memberswhich bind to the predetermined receptor or epitope to produce a pool ofselected library members, (2) shuffling by recombination the selectedlibrary members (or amplified or cloned copies thereof) which binds thepredetermined epitope and has been thereby isolated and/or enriched fromthe library to generate a shuffled library, and (3) screening theshuffled library against the predetermined receptor (e.g., ligand) orepitope (e.g., antigen macromolecule) and identifying and/or enrichingshuffled library members which bind to the predetermined receptor orepitope to produce a pool of selected shuffled library members.

In one embodiment, the present invention provides a method forgenerating libraries of displayed polypeptides or displayed antibodiessuitable for affinity interaction screening or phenotypic screening. Themethod comprises (1) obtaining a first plurality of selected librarymembers comprising a displayed polypeptide or displayed antibody and anassociated polynucleotide encoding said displayed polypeptide ordisplayed antibody, and obtaining said associated polynucleotides orcopies thereof wherein said associated polynucleotides comprise a regionof substantially identical sequences, optimally introducing mutationsinto said polynucleotides or copies, (2) pooling the polynucleotides orcopies, (3) producing smaller or shorter polynucleotides by interruptinga random or particularized priming and synthesis process or anamplification process, and (4) performing amplification, preferably PCRamplification, and optionally mutagenesis to homologously recombine thenewly synthesized polynucleotides.

It is a particularly preferred object of the invention to provide aprocess for producing hybrid polynucleotides which express a usefulhybrid polypeptide by a series of steps comprising:

-   -   (a) producing polynucleotides by interrupting a polynucleotide        amplification or synthesis process with a means for blocking or        interrupting the amplification or synthesis process and thus        providing a plurality of smaller or shorter polynucleotides due        to the replication of the polynucleotide being in various stages        of completion;    -   (b) adding to the resultant population of single- or        double-stranded polynucleotides one or more single- or        double-stranded oligonucleotides, wherein said added        oligonucleotides comprise an area of identity in an area of        heterology to one or more of the single- or double-stranded        polynucleotides of the population;    -   (c) denaturing the resulting single- or double-stranded        oligonucleotides to produce a mixture of single-stranded        polynucleotides, optionally separating the shorter or smaller        polynucleotides into pools of polynucleotides having various        lengths and further optionally subjecting said polynucleotides        to a PCR procedure to amplify one or more oligonucleotides        comprised by at least one of said polynucleotide pools;    -   (d) incubating a plurality of said polynucleotides or at least        one pool of said polynucleotides with a polymerase under        conditions which result in annealing of said single-stranded        polynucleotides at regions of identity between the        single-stranded polynucleotides and thus forming of a        mutagenized double-stranded polynucleotide chain;    -   (e) optionally repeating steps (c) and (d);    -   (f) expressing at least one hybrid polypeptide from said        polynucleotide chain, or chains; and    -   (g) screening said at least one hybrid polypeptide for a useful        activity.

In a preferred aspect of the invention, the means for blocking orinterrupting the amplification or synthesis process is by utilization ofUV light, DNA adducts, DNA binding proteins.

In one embodiment of the invention, the DNA adducts, or polynucleotidescomprising the DNA adducts, are removed from the polynucleotides orpolynucleotide pool, such as by a process including heating the solutioncomprising the DNA fragments prior to further processing.

Having thus disclosed exemplary embodiments of the present invention, itshould be noted by those skilled in the art that the disclosures areexemplary only and that various other alternatives, adaptations andmodifications may be made within the scope of the present invention.Accordingly, the present invention is not limited to the specificembodiments as illustrated herein.

5.3.1. Antibody Production

High affinity single-chain antibodies of the desired specificity can beengineered and expressed in a variety of systems. For example, scfv havebeen produced in plants (Firek et al, 1993) and can be readily made inprokaryotic systems (Owens and Young, 1994; Johnson and Bird, 1991).Furthermore, the single-chain antibodies can be used as a basis forconstructing whole antibodies or various fragments thereof(Kettleborough et al, 1994). The variable region encoding sequence maybe isolated (e.g., by PCR amplification or subcloning) and spliced to asequence encoding a desired human constant region to encode a humansequence antibody more suitable for human therapeutic uses whereimmunogenicity is preferably minimized. The polynucleotide(s) having theresultant fully human encoding sequence(s) can be expressed in a hostcell (e.g., from an expression vector in a mammalian cell) and purifiedfor pharmaceutical formulation. The antibodies generated by the methodof the present invention can be used for diagnosis and therapy. By wayof illustration and not limitation, they can be used to treat cancer,autoimmune diseases, or viral infections. For treatment of cancer, theantibodies will typically bind to an antigen expressed preferentially oncancer cells, such as erbB-2, CEA, CD33, and many other antigens andbinding members well known to those skilled in the art.

5.3.1.1. Modified Variable Regions

Beginning in 1988, single-chain analogues of Fv fragments and theirfusion proteins have been reliably generated by antibody engineeringmethods. The first step generally involves obtaining the genes encodingVH and VL domains with desired binding properties; these V genes may beisolated from a specific hybridoma cell line, selected from acombinatorial V-gene library, or made by V gene synthesis. Thesingle-chain Fv is formed by connecting the component V genes with anoligonucleotide that encodes an appropriately designed linker peptide,such as (Gly-Gly-Gly-Gly-Ser)3 or equivalent linker peptide(s). Thelinker bridges the C-terminus of the first V region and N-terminus ofthe second, ordered as either VH-linker-VL or VL-linker-VH In principle,the scfv binding site can faithfully replicate both the affinity andspecificity of its parent antibody combining site.

Thus, scfv fragments are comprised of VH and VL domains linked into asingle polypeptide chain by a flexible linker peptide. After the scfvgenes are assembled, they are cloned into a phagemid and expressed atthe tip of the M13 phage (or similar filamentous bacteriophage) asfusion proteins with the bacteriophage PIII (gene 3) coat protein.Enriching for phage expressing an antibody of interest is accomplishedby panning the recombinant phage displaying a population scfv forbinding to a predetermined epitope (e.g., target antigen, receptor).

The linked polynucleotide of a library member provides the basis forreplication of the library member after a screening or selectionprocedure, and also provides the basis for the determination, bynucleotide sequencing, of the identity of the displayed peptide sequenceor VH and VL amino acid sequence. The displayed peptide (s) orsingle-chain antibody (e.g., scfv) and/or its VH and VL domains or theirCDRs can be cloned and expressed in a suitable expression system. Oftenpolynucleotides encoding the isolated VH and VL domains will be ligatedto polynucleotides encoding constant regions (CH and CL) to formpolynucleotides encoding complete antibodies (e.g., chimeric orfully-human), antibody fragments, and the like. Often polynucleotidesencoding the isolated CDRs will be grafted into polynucleotides encodinga suitable variable region framework (and optionally constant regions)to form polynucleotides encoding complete antibodies (e.g., humanized orfully-human), antibody fragments, and the like. Antibodies can be usedto isolate preparative quantities of the antigen by immunoaffinitychromatography. Various other uses of such antibodies are to diagnoseand/or stage disease (e.g., neoplasia) and for therapeutic applicationto treat disease, such as for example: neoplasia, autoimmune disease,AIDS, cardiovascular disease, infections, and the like.

If it were possible to generate scfv libraries having broader antibodydiversity and overcoming many of the limitations of conventional CDRmutagenesis and randomization methods which can cover only a very tinyfraction of the potential sequence combinations, the number and qualityof scfv antibodies suitable for therapeutic and diagnostic use could bevastly improved. To address this, the in vitro and in vivo shufflingmethods of the invention are used to recombine CDRs which have beenobtained (typically via PCR amplification or cloning) from nucleic acidsobtained from selected displayed antibodies. Such displayed antibodiescan be displayed on cells, on bacteriophage particles, on polysomes, orany suitable antibody display system wherein the antibody is associatedwith its encoding nucleic acid(s). In a variation, the CDRs areinitially obtained from mRNA (or cDNA) from antibody-producing cells(e.g., plasma cells/splenocytes from an immunized wild-type mouse, ahuman, or a transgenic mouse capable of making a human antibody as in WO92/03918, WO 93/122227, and WO 94/25585), including hybridomas derivedtherefrom.

Polynucleotide sequences selected in a first selection round (typicallyby affinity selection for displayed antibody binding to an antigen(e.g., a ligand) by any of these methods are pooled and the pool(s)is/are shuffled by in vitro and/or in vivo recombination, especiallyshuffling of CDRs (typically shuffling heavy chain CDRs with other heavychain CDRs and light chain CDRs with other light chain CDRs) to producea shuffled pool comprising a population of recombined selectedpolynucleotide sequences. The recombined selected polynucleotidesequences are expressed in a selection format as a displayed antibodyand subjected to at least one subsequent selection round. Thepolynucleotide sequences selected in the subsequent selection round(s)can be used directly, sequenced, and/or subjected to one or moreadditional rounds of shuffling and subsequent selection until anantibody of the desired binding affinity is obtained. Selected sequencescan also be back-crossed with polynucleotide sequences encoding neutralantibody framework sequences (i.e., having insubstantial functionaleffect on antigen binding), such as for example by back-crossing with ahuman variable region framework to produce human-like sequenceantibodies. Generally, during back-crossing subsequent selection isapplied to retain the property of binding to the predetermined antigen.

Alternatively, or in combination with the noted variations, the valencyof the target epitope may be varied to control the average bindingaffinity of selected scfv library members. The target epitope can bebound to a surface or substrate at varying densities, such as byincluding a competitor epitope, by dilution, or by other method known tothose in the art. A high density (valency) of predetermined epitope canbe used to enrich for scfv library members which have relatively lowaffinity, whereas a low density (valency) can preferentially enrich forhigher affinity scfv library members.

For generating diverse variable segments, a collection of syntheticoligonucleotides encoding random, pseudorandom, or a defined sequencekernal set of peptide sequences can be inserted by ligation into apredetermined site (e.g., a CDR). Similarly, the sequence diversity ofone or more CDRs of the single-chain antibody cassette(s) can beexpanded by mutating the CDR(s) with site-directed mutagenesis,CDR-replacement, and the like. The resultant DNA molecules can bepropagated in a host for cloning and amplification prior to shuffling,or can be used directly (i.e., may avoid loss of diversity which mayoccur upon propagation in a host cell) and the selected library memberssubsequently shuffled.

A variety of techniques can be used in the present invention todiversify a peptide library or single-chain antibody library, or todiversify, prior to or concomitant with shuffling, around variablesegment peptides found in early rounds of panning to have sufficientbinding activity to the predetermined macromolecule or epitope. In oneapproach, the positive selected peptide/polynucleotide complexes (thoseidentified in an early round of affinity enrichment) are sequenced todetermine the identity of the active peptides. Oligonucleotides are thensynthesized based on these active peptide sequences, employing a lowlevel of all bases incorporated at each step to produce slightvariations of the primary oligonucleotide sequences. This mixture of(slightly) degenerate oligonucleotides is then cloned into the variablesegment sequences at the appropriate locations. This method producessystematic, controlled variations of the starting peptide sequences,which can then be shuffled. It requires, however, that individualpositive nascent peptide/polynucleotide complexes be sequenced beforemutagenesis, and thus is useful for expanding the diversity of smallnumbers of recovered complexes and selecting variants having higherbinding affinity and/or higher binding specificity. In a variation,mutagenic PCR amplification of positive selected peptide/polynucleotidecomplexes (especially of the variable region sequences, theamplification products of which are shuffled in vitro and/or in vivo andone or more additional rounds of screening is done prior to sequencing.The same general approach can be employed with single-chain antibodiesin order to expand the diversity and enhance the bindingaffinity/specificity, typically by diversifying CDRs or adjacentframework regions prior to or concomitant with shuffling. If desired,shuffling reactions can be spiked with mutagenic oligonucleotidescapable of in vitro recombination with the selected library members canbe included. Thus, mixtures of synthetic oligonucleotides and PCRproduced polynucleotides (synthesized by error-prone or high-fidelitymethods) can be added to the in vitro shuffling mix and be incorporatedinto resulting shuffled library members (shufflants).

5.3.1.2. Modified CDR Regions

The present invention of shuffling enables the generation of a vastlibrary of CDR-variant single-chain antibodies. One way to generate suchantibodies is to insert synthetic CDRs into the single-chain antibodyand/or CDR randomization prior to or concomitant with shuffling. Thesequences of the synthetic CDR cassettes are selected by referring toknown sequence data of human CDR and are selected in the discretion ofthe practitioner according to the following guidelines: synthetic CDRswill have at least 40 percent positional sequence identity to known CDRsequences, and preferably will have at least 50 to 70 percent positionalsequence identity to known CDR sequences. For example, a collection ofsynthetic CDR sequences can be generated by synthesizing a collection ofoligonucleotide sequences on the basis of naturally-occurring human CDRsequences listed in Kabat (Kabat et al, 1991); the pool (s) of syntheticCDR sequences are calculated to encode CDR peptide sequences having atleast 40 percent sequence identity to at least one knownnaturally-occurring human CDR sequence. Alternatively, a collection ofnaturally-occurring CDR sequences may be compared to generate consensussequences so that amino acids used at a residue position frequently(i.e., in at least 5 percent of known CDR sequences) are incorporatedinto the synthetic CDRs at the corresponding position(s). Typically,several (e.g., 3 to about 50) known CDR sequences are compared andobserved natural sequence variations between the known CDRs aretabulated, and a collection of oligonucleotides encoding CDR peptidesequences encompassing all or most permutations of the observed naturalsequence variations is synthesized. For example but not for limitation,if a collection of human V_(H) CDR sequences have carboxy-terminal aminoacids which are either Tyr, Val, Phe, or Asp, then the pool(s) ofsynthetic CDR oligonucleotide sequences are designed to allow thecarboxy-terminal CDR residue to be any of these amino acids. In someembodiments, residues other than those which naturally-occur at aresidue position in the collection of CDR sequences are incorporated:conservative amino acid substitutions are frequently incorporated and upto 5 residue positions may be varied to incorporate non-conservativeamino acid substitutions as compared to known naturally-occurring CDRsequences. Such CDR sequences can be used in primary library members(prior to first round screening) and/or can be used to spike in vitroshuffling reactions of selected library member sequences. Constructionof such pools of defined and/or degenerate sequences will be readilyaccomplished by those of ordinary skill in the art.

The collection of synthetic CDR sequences comprises at least one memberthat is not known to be a naturally-occurring CDR sequence. It is withinthe discretion of the practitioner to include or not include a portionof random or pseudorandom sequence corresponding to N region addition inthe heavy chain CDR; the N region sequence ranges from 1 nucleotide toabout 4 nucleotides occurring at V-D and D-J junctions. A collection ofsynthetic heavy chain CDR sequences comprises at least about 100 uniqueCDR sequences, typically at least about 1,000 unique CDR sequences,preferably at least about 10,000 unique CDR sequences, frequently morethan 50,000 unique CDR sequences; however, usually not more than about1×10 6 unique CDR sequences are included in the collection, althoughoccasionally 1×107 to 1×108 unique CDR sequences are present, especiallyif conservative amino acid substitutions are permitted at positionswhere the conservative amino acid substituent is not present or is rare(i.e., less than 0.1 percent) in that position in naturally-occurringhuman CDRS. In general, the number of unique CDR sequences included in alibrary should not exceed the expected number of primary transformantsin the library by more than a factor of 10. Such single-chain antibodiesgenerally bind of about at least 1×10 m-, preferably with an affinity ofabout at least 5×1 M-1, more preferably with an affinity of at least1×10⁸ M-1 to 1×10⁹ M-1 or more, sometimes up to 1×10 M-1 or more.Frequently, the predetermined antigen is a human protein, such as forexample a human cell surface antigen (e.g., CD4, CD8, IL-2 receptor, EGFreceptor, PDGF receptor), other human biological macromolecule (e.g.,thrombomodulin, protein C, carbohydrate antigen, sialyl Lewis antigen,Lselectin), or nonhuman disease associated macromolecule (e.g.,bacterial LPS, virion capsid protein or envelope glycoprotein) and thelike.

5.4 Increased Expression in a Recombinant Host

In one embodiment of this invention, it provides for increasingexpression of a gene or trait of interest in a recombinant host ofinterest. The hosts can include but are not limited to bacteria, fungi,protozoans, viruses, animals, insects, and plants.

5.5 Metabolite Shifting

In one embodiment of this invention, it provides for metaboliteshifting.

5.6 Creating a Modified Plant with Desired Traits

One aspect of the present invention relates to the use of trait DNAmolecules which are heterologous to the plant—e.g., DNA molecules thatconfer disease resistance to plants transformed with the DNA construct.The present invention is useful in plants for imparting resistance to awide variety of pathogens including viruses, bacteria, fungi, viroids,phytoplasmas, nematodes, and insects. The present invention may also beused in mammals to impart genetic traits. Resistance, inter alia, to thefollowing viruses can be achieved by the method of the presentinvention: tomato spotted wilt virus, impatiens necrotic spot virus,groundnut ringspot virus, potato virus Y, potato virus X, tobacco mosaicvirus, turnip mosaic virus, tobacco etch virus, papaya ringspot virus,tomato mottle virus, tomato yellow leaf curl virus, or -combinationsthereof. Resistance, inter alia, to the following bacteria can also beimparted to plants in accordance with present invention: Pseudomonassolancearum, Pseudomonas syringae pv. tabaci, Xanthamonas campestris pv.pelargonii, and Agrobacterium tumefaciens. Plants can be made resistant,inter alia, to the following fungi by use of the method of the presentinvention: Fusarium oxysporum and Phytophthora infestans. Suitable DNAmolecules include a DNA molecule encoding a coat protein, a replicase, aDNA molecule not encoding protein, a DNA molecule encoding a viral geneproduct, or combinations thereof.

The present invention is also used to confer traits other than diseaseresistance on plants. For example, DNA molecules which impart a plantgenetic trait can be used as the DNA trait molecule of the presentinvention. In this aspect of the present invention, suitable trait DNAmolecules encode for desired color, enzyme production, or combinationsthereof. In another embodiment of this invention, it provides forengineering plants with desired traits, including output (e.g.,producing increased amounts of a desired vitamin, mineral, orgenetically engineered and introduced molecule such as an antibody) andinput (e.g., drought and/or salinity resistance) traits.

5.7 Plant Gene Expression

This invention is related to the genetic engineering of plants and to ameans and method (use of DNA construct) for conferring a plurality oftraits, including resistance to viruses, to a plant using a vectorencoding a plurality of genes, such as coat protein genes, proteasegenes, or replicase genes. The field of the invention is plant genetics,including genetic mapping and restriction fragment length polymorphismtechnology.

The present invention also relates to:

-   -   (i) the production of mature proteins in plant cells, including        the production of proteins in mature secreted form.    -   (i) the development of techniques for the commercial production        of transgenic plants.        5.7.1 General Considerations

The present invention provides a chimeric recombinant DNA moleculecomprising: a plurality of DNA sequences, each of which comprises aplant-functional promoter linked to a coding region, which encodes avirus-associated coat protein, wherein said DNA sequences are preferablylinked in-tandem so that they are expressed in virus-susceptible plantcells transformed with said recombinant DNA molecule to impartresistance to said viruses; as well as methods for transforming plantswith the chimeric constructs and for selecting plants which express atleast one of said DNA sequences imparting viral resistance.

Methods of making a genetically modified plant comprising regenerating awhole plant from a plant cell that has been transfected with DNAsequences comprising a first gene whose expression results in an alteredplant phenotype linked to a transiently active promoter, the gene andpromoter being separated by a blocking sequence flanked on either sideby specific excision sequences, a second gene that encodes a recombinasespecific for the specific excision sequences linked to a repressiblepromoter, and a third gene that encodes the repressor specific for therepressible promoter. Also a method for making a genetically modifiedhybrid plant by hybridizing a first plant regenerated from a plant cellthat has been transfected with DNA sequences comprising a first genewhose expression results in an altered plant phenotype linked to atransiently active promoter, the gene and promoter being separated by ablocking sequence flanked on either side by specific excision sequencesto a second plant regenerated from a second plant cell that has beentransfected with DNA sequences comprising a second gene that encodes arecombinase specific for the specific excision sequences linked to apromoter that is active during seed germination, and growing a hybridplant from the hybrid seed. Plant cells, plant tissues, plant seed andwhole plants containing the above DNA sequences are also claimed.

The present invention is also directed to a DNA construct formed from afusion gene which includes a trait DNA molecule and a silencer DNAmolecule. The trait DNA molecule has a length that is insufficient toimpart a desired trait to plants transformed with the trait DNAmolecule. The silencer DNA molecule is operatively coupled to the traitDNA molecule with the trait and silencer DNA molecules collectivelyhaving sufficient length to impart the trait to plants transformed withthe DNA construct. Expression systems, host cells, plants, and plantseeds containing the DNA construct are disclosed. The present inventionis also directed to imparting multiple traits to a plant.

The present invention is also directed to methods of introgressing oneor more desired quantitative traits into a plant comprising screeningone or more restriction fragment length polymorphisms (RFLP) forassociation with desired quantitative traits (QT), selecting one or moreRFLP's showing association with the desired QT's, developing amathematical model based on the magnitude of the association of RFLP(s)to predict the degree of expression of the desired QT's, and using thethus-selected RFLP(s) and the mathematical model in a plant breedingprogram to predict the degree of introgression and expression of thedesired QT's in plant progeny.

A method for producing one of the following proteins in transgenicmonocot plant cells is disclosed: (i) mature, glycosylatedα₁-antitrypsin (AAT) having the same N-terminal amino acid sequence asmature AAT produced in humans and a glycosylation pattern whichincreases serum halflife substantially over that of maturenon-glycosylated AAT; (ii) mature, glycosylated antithrombin m (ATIII)having the same N-terminal amino acid sequence as mature ATIII producedin humans; (iii) mature human serum albumin (HSA) having the same N—temninal amino acid sequence as mature HSA produced in humans and havingthe folding pattern of native mature HSA as evidenced by itsbilirubin-binding characteristics; and (iv) mature, active subtilisinBPN′ (BPN′) having the same N-terminal amino acid sequence as BPN′produced in Bacillus. Monocot plants cells are transformed with achimeric gene which includes a DNA coding sequence encoding a fusionprotein having an (i) N-terminal moiety corresponding to a riceα-amylase signal sequence peptide and, (iii) immediately adjacent theC-terminal amino acid of said peptide, a protein moiety corresponding tothe mature protein to be produced.

A process for commercially propagating plants by tissue culture in sucha way as both to conserve desired plant morphology and to transform theplant with respect to one or more desired genes. The method includes thesteps of (a) creating an Agrobacterium vector containing the genesequence desired to be transferred to the propagated plant, preferablytogether with a marker gene; (b) taking one or more petiole explantsfrom a mother plant and inoculating them with the Agrobacterium vector;(c) conducting callus formation in the petiole sections in culture, inthe dark; and (d) culturing the resulting callus in growth mediumcontaining a benzylamino growth regulator such as benzylaminopurine or,most preferably, benzylaminopurine-riboside. Additional optional growthregulators including auxins and cytokinins (indole butyric acid,benzylamine, benzyladenine, benzylaminopurine, alpha naphthylacetic acidand others known in the art) may also be present. Preferably, thepetiole tissue is taken from Pelargonium x domesticum and theAgrobacterium vector contains an antisense gene for ACC synthase or ACCoxidase to prevent ACC synthase or ACC oxidase expression and, in turn,the ethylene formation for which these enzymes are precursors.

5.7.2. Production of Virus Resistant Plants

5.7.2.1 Production of Virus Resistant Plants

Scientists have recently developed means to produce virus resistantplants using genetic engineering techniques. Several different types ofhost resistance to viruses are recognized. The host may be resistant to:(1) establishment of infection, (2) virus multiplication, or (3) viralmovement. One potential application would be to engineer a plant that isresistant to potyviruses. Potyviruses are a distinct group of plantviruses which are pathogenic to various crops, and which demonstratecross-infectivity between plant members of different families. Oneexample is that expression of the coat protein genes from tobacco mosaicvirus, alfalfa mosaic virus, cucumber mosaic virus, and potato virus X,among others, in transgenic plants has resulted in plants which areresistant to infection by the respective virus. Some evidence ofheterologous protection has also been reported. For example, seereferences Namba et al., Phytopathology, 82, 940 (1992) Stark et al.,Biotechnology, 1, 1257 (1989).

5.7.2.2 Using “Pathogen Driven Resistance” (PDR) for Developing VirusResistant Transgenic Plants

Control of plant virus diseases took a major step forward in the lastdecade when it was shown in 1986 that the tobacco mosaic virus (“TMV”)coat protein gene that was expressed in transgenic tobacco conferredresistance to TMV (Powell-Abel, P., et al., “Delay of DiseaseDevelopment in Transgenic Plants that Express the Tobacco Mosaic VirusCoat Protein Gene,” Science, 232: 738-43 (1986)). The concept ofpathogen-derived resistance (“PDR”), which states that pathogen genesthat are expressed in transgenic plants will confer resistance toinfection by the homologous or related pathogens (Sanford, J. C., et al.“The Concept of Parasite-Derived Resistance—Deriving Resistance Genesfrom the Parasite's Own Genome,” J. Theor. Biol., 113: 395-405 (1985))was introduced at about the same time. Since then, numerous reports haveconfirmed that PDR is a useful strategy for developing transgenic plantsthat are resistant to many different viruses (Lomonossoff, G. P.,“Pathogen-Derived Resistance to Plant Viruses,” Ann. Rev. Photopathol.,33: 323-43 (1995)).

Only eight years after the report by Beachy and colleagues (Powell-Abel,P., et al., “Delay of Disease Development in Transgenic Plants thatExpress the Tobacco Mosaic Virus Coat Protein Gene,” Science, 232:738-43 (1986)), Grumet, R., “Development of Virus Resistant Plants viaGenetic Engineering,” Plant Breedinq Reviews, 12: 47-49 (1994) reviewedthe PDR literature and listed the successful development of virusresistant transgenic plants to at least 11 different groups of plantviruses.

5.7.2.2.1 Utilizing The Coat Protein Genes

The vast majority of reports have utilized the coat protein genes of theviruses that are targeted for control (e.g., Grumet, R., “Development ofVirus Resistant Plants via Genetic Engineering,” Plant Breeding Reviews,12: 47-49 (1994)). Additional examples are included in the followingreferences: Fuchs, M., et al., Bio/Technology 13: 1466-73 (1995);Tricoli, D. M., et al, Bio/Technology, 13: 1458-65 (1995); Fitch, M. M.M., et al., Bio/Technology, 1466-72 (1992); Tennant, P. F., et al.,Phytopathology, 84: 1359-66 (1994).

5.7.2.2.2 Other Effective Viral Genes

Interestingly, remarkable progress has been made in developing virusresistant transgenic plants despite a poor understanding of themechanisms involved in the various forms of pathogen-derived resistance(Lomonossoff, G. P., “Pathogen-Derived Resistance to Plant Viruses,”Ann.-Rev. Photopathol., 33: 323-43 (1995)). Various reports haveutilized proteins other than the coat protein genes to confer resistance(Golemboski, D. B., et al., Proc. Natl. Acad. Sci. USA, 87: 6311-15(1990); Beck, D. L., et al., Proc. Natl. Acad. Sci. USA, 91: 10310-14(1994); Maiti, I. B., et al., Proc. Natl. Acad. Sci. USA, 90: 6110-14(1993). Furthermore, the viral genes can be effective in thetranslatable and nontranslatable sense forms, and less frequentlyantisense forms (e.g., Baulcombe, D. C., “Mechanisms of Pathogen-DerivedResistance to Viruses in Transgenic Plants,” Plant Cell, 8: 1833-44(1996); Dougherty, W. G., et al., “Transgenes and Gene Suppression:Telling us Something New?,” Current Opinion in Cell Biology, 7: 399-05(1995); Lomonossoff, G. P., “Pathogen-Derived Resistance to PlantViruses,” Ann. Rev. Photopathol. 33: 323-43 (1995)).

5.7.2.2.3. RNA-Mediated Resistance

5.7.2.2.3.1 Description (A Form of PDR)

RNA-mediated resistance is the form of PDR where there is clear evidencethat viral proteins do not play a role in conferring resistance to thetransgenic plant. The first clear cases for RNA-mediated resistance werereported in 1992 for tobacco etch (“TEV”) potyvirus (Lindbo, et al.,“Pathogen-Derived Resistance to a Potyvirus Immune and ResistancePhenotypes in Transgenic Tobacco Expressing Altered Forms of a PotyvirusCoat Protein Nucleotide Sequence,” Mol. Plant Microbe Interact., 5:144-53 (1992)), for potato virus Y (“PVY”) potyvirus by Van Der Vlugt,R. A. A., et al., “Evidence for Sense RNA-Mediated Protection to PVY inTobacco Plants Transformed with the Viral Oat Protein Cistron,” PlantMol. Biol., 20: 631-39 (1992), and for tomato spotted wilt (“TSWV”)tospovirus by de Haan, P., et al., “Characterization of RNA-MediatedResistance to Tomato Spotted Wilt Virus in Transgenic Tobacco Plants,”Bio/Technology 10: 1133-37 (1992). others confirmed the occurrence ofRNA-mediated resistance with potyviruses (Smith, H. A., et al.,“Transgenic Plant Virus Resistance Mediated by Untranslatable SenseRNAs: Expression, Regulation, and Fate of Nonessential RNAs,” PlantCell, 6: 1441-53 (1994)), potexviruses (Mueller, E., et al.,“Homology-Dependent Resistance: Transgenic Virus Resistance in PlantsRelated to Homology-Dependent Gene Silencing,” Plant Journal, 7: 1001-13(1995)), and TSWV and other topsoviruses (Pang, S. Z., et al.,“Resistance of Transgenic Nicotiana Benthamiana Plants to Tomato SpottedWilt and Impatiens Necrotic Spot Tospoviruses: Evidence of Involvementof the N Protein and N Gene RNA in Resistance,” Phytopathology, 84:243-49 (1994); Pang, S.-Z., et al., “Different Mechanisms ProtectTransgenic Tobacco Against Tomato Spotted Wilt Virus and ImpatiensNecrotic Spot Tospoviruses,” Bio/Technology 11: 819-24 (1993)). Morerecent work has shown that RNA-mediated resistance also occurs with thecomovirus cowpea mosaic virus (Sijen, T., et al., “RNA-Mediated VirusResistance: Role of Repeated Transgene and Delineation of TargetedRegions,” Plant Cell, 8: 2227-94 (1996)) and squash mosaic virus (Jan,F.-J., et al., “Genetic and Molecular Analysis of Squash PlantsTransformed with Coat Protein Genes of Squash Mosaic Virus,”Phytopathology, 86:S16-17 (1996)).

5.7.3 Creating Transgenic Plants with Controllable Genes

This invention also relates to certain transgenic plants and involves amethod of creating transgenic plants with controllable genes. Moreparticularly, the invention relates to transgenic plants that have beenmodified such that expression of a desired introduced gene can belimited to a particular stage of plant development, a particular planttissue, particular environmental conditions, or a particular time orlocation, or a combination of these situations.

5.7.3.1 Inducible Gene Promoter: “Gene Switch”

Various gene expression control elements that are operable in one ormore species of organisms are known. Examples are mentioned in PCTApplication WO 90/08826 and PCT application WO 94/03619. ATetracycline-controlled plant-active repressor-operator system can beutilized as described in various references: Gatz and Quail (1988) andGatz, et al. (1992), (Hoppe-Seyler), 372: 659-660 (1991); Gatz andQuail, 1988; and (Gatz, et al., 1992).

5.7.4 Recombinant Production of Proteins

A major commercial focus of biotechnology is the recombinant productionof proteins, including both industrial enzymes and proteins that haveimportant therapeutic uses.

5.7.4.1. Alternative Protein Expression System to Overcome Problems ofMicrobial and Mammalian Systems

It would therefore be desirable to produce selected therapeutic andindustrial proteins in a protein expression system that largelyovercomes problems associated with microbial and mammalian-cell systems.In particular, production of the proteins should allow large volumeproduction at low cost, and yield properly processed and glycosylatedproteins. The production system should also have a relatively stablegenotype from generation to generation. These aims are achieved, in thepresent invention, for the therapeutic proteins AAT, HSA, andantithrombin m (ATM), and the industrial enzyme subtilisin BPN′.

5.7.4.2. Uses

Various proteins of interest could be produced such as but not limitedto:

-   -   1) Human α₁-antitrypsin (AAT; Carrell, P., et al., Nature (1992)        298: 329; involved in cirrhosis and liver failure: e.g., Wu, Y.,        et al., BioEssays 13(4): 163 (1991),    -   2) Human Antithrombin III (ATIII) (potentially useful in the        prevention of thrombosis and pulmonary embolism),    -   3) Human Serum Albumin (Geisow, M. J. et al. (1977) Biochem. J.        163: 477-484; HSA is used to expand blood volume and raise low        blood protein levels in cases of shock, trauma, and        post-surgical recovery. HSA is often administered in emergency        situations to stabilize blood pressure).    -   4) Subtilisin BPN′ is an important industrial enzyme,        particularly for use as a detergent enzyme.        5.7.5 Practical Method for the Commercial Production of        Transgenic Plants

Translating genetic engineering theory into practice, however, and thenfurthermore into a commercially practical reality, requires ingenuity.Gene transplantation in plants has already been accomplished at thiswriting—and examples are cited below—but heretofore no practical methodfor the commercial production of transgenic plants has been perfected.Genetic engineering of plants may involve any and/or all of thefollowing steps: tissue culture propagation, gene transplantation (eg.,with Agrobacterium and T-DNA), the binary system (using binary vectors).

A general reference is Buchanan, B. B. et al., Biochemistry andMolecular Biology of Plants, ASPP Publications, 2001. Exemplarypublications and patents which disclose transgenic plants and varioustechniques therefor are summarized below. Pellegrineschi, A., et al.,Bio/Technology, Vol. 12 (January, 1994) discloses transformation of rootcultures by inoculating stem and leaf fragments with Agrobacteriumrhizoaenes. An important plasmid in this species of Agrobacterium is theroot-inducing plasmid which can be used to transfer to the plant genomethe genes necessary for improved root growth in culture. The use ofsterilized petioles as the source of explant material for planttransformation and culture is disclosed. U.S. Pat. No. 5,276,268 toStrauch et al., entitled 11 Phosphinothricin-Resistance Gene, and ItsUse,” is directed to the transfer of phosphinothricin-resistance geneinto plants using Agrobacterium species. A modification of the binaryvector method is discussed, and the phosphinothricin-resistance genenucleic acid sequences are provided. U.S. Pat. No. 5,283,184; U.S. Pat.No. 5,286,635.

5.7.6 Method of Identifying and Characterizing the Role of IndividualPlant Genes in Quantitative Trait Expression

One area in which biotechnology may have a significant impact on plantimprovement is in the development of new methods to identify andcharacterize the role of individual plant genes in quantitative traitexpression. Following the development of a new class of plant molecularmarkers based on restriction fragment length polymorphisms, termed“RFLPs”, (Helentjaris et al., Plant Mol. Bio. 5: 109-118 (1985))(“Helentjaris et al. I”), the processes to identify such loci anddiscriminate gene effects have been invented and are described andclaimed herein. This and all other publications noted herein are herebyincorporated by reference. This will undoubtedly benefit plantimprovement, not only within the context of conventional breedingapproaches, but also by providing a means for identifying appropriateloci for future cloning and direct gene transfer efforts.

5.7.7. Fusion Genes

The present invention is directed to a DNA construct formed from afusion gene which includes a trait DNA molecule and a silencer DNAmolecule. In an alternative embodiment of the present invention, the DNAconstruct can be a fusion gene comprising a plurality of trait DNAmolecules at least some of which having a length that is insufficient toimpart that trait to plants transformed with that trait DNA molecule.However, the plurality of trait DNA molecules collectively have a lengthsufficient to impart their traits to plants transformed with the DNAconstruct and to effect post-transcriptional silencing of the fusiongene. Expression systems, host cells, plants, and plant seeds containingthis embodiment of the DNA construct are disclosed.

The present invention also provides a recombinant chimeric DNA moleculecomprising a plurality of DNA sequences each of which comprises apromoter operably linked to a DNA sequence which encodes avirus-associated protein, such as a coat protein (cp), a protease, or areplicase, wherein said DNA sequences are expressed in virus-susceptibleplant cells transformed with said recombinant DNA molecule to impartresistance to infection by each of said viruses. Preferably, the DNAsequences are linked in tandem, i.e., exist in head to tail orientationrelative to one another. Also, preferably substantially equal levels ofresistance to infection by each of said viruses occurs in plant cellstransformed with said plurality of DNA sequences.

Preferably, each DNA sequence is also linked to a 3′ non-translated DNAsequence which functions in plant cells to cause the termination oftranscription and the addition of polyadenylated ribonucleotides to the3′ end of the transcribed mRNA sequences. Preferably, the virus is aplant-associated virus, such as a potyvirus.

Thus, the present DNA molecule can be employed as a chimeric recombinant“expression construct,” or “expression cassette” to prepare transgenicplants that exhibit increased resistance to infection by at least twoplant viruses, such as potyviruses. The present cassettes alsopreferably comprise at least one selectable marker gene or reporter genewhich is stably integrated into the genome of the transformed plantcells in association with the viral genes. The selectable marker and/orreporter genes facilitate identification of transformed plant cells andplants. Preferably, the virus gene array is flanked by two or moreselectable marker genes, reporter genes or a combination thereof.

Another aspect of the present invention is a method of preparing avirus-resistant plant, such as a dicot, comprising:

-   -   (a) transforming plant cells with a chimeric recombinant DNA        molecule comprising a plurality of DNA sequences, each        comprising a promoter functional in said plant cells, operably        linked to a DNA sequence, which encodes a protein associated        with a virus which is capable of infecting said plant; (b)        regenerating said plant cells to provide a differentiated plant;        and (c) identifying a transformed plant which expresses the DNA        sequences so as to render the plant resistant to infection by        said viruses, preferably at substantially equal levels of        resistance to infection by each virus.

Yet another object of the present invention is to provide a method forproviding resistance to infection by viruses in a susceptibleCucurbitaceae plant which comprises:

-   -   (a) transforming Cucurbitaceae plant cells with a DNA molecule        encoding a plurality of proteins from viruses which are capable        of infecting said Cucurbitaceae plant; (b) regenerating said        plant cells to provide a differentiated plant; and (c) selecting        a transformed Cucurbitaceae which expresses the virus proteins        at levels sufficient to render the plant resistant to infection        by said viruses.

It is a further object of the present invention to provide multi-virusresistant transformed plant which contains stably-integrated DNAsequences encoding virus proteins.

5.7.8 Controlling Gene Expression with External Stimulus

The present invention involves, in one embodiment, the creation of atransgenic plant that contains a gene whose expression can be controlledby application of an external stimulus. This system achieves a positivecontrol of gene expression by an external stimulus, without the need forcontinued application of the external stimulus to maintain geneexpression. The present invention also involves, in a second embodiment,the creation of transgenic parental plants that are hybridized toproduce a progeny plant expressing a gene not expressed in eitherparent. By controlling the expression of genes that affect the plantphenotype, it is possible to grow plants under one set of conditions orin one environment where one phenotype is advantageous, then either movethe plant or plant its seed under another set of conditions or inanother environment where a different phenotype is advantageous. Thistechnique has particular utility in agricultural and horticulturalapplications.

In accordance with one embodiment of the invention, a series ofsequences is introduced into a plant that includes a transiently-activepromoter linked to a structural gene, the promoter and structural genebeing separated by a blocking sequence that is in turn bounded on eitherside by specific excision sequences, a repressible promoter operablylinked to a gene encoding a site-specific recombinase capable ofrecognizing the specific excision sequences, and a gene encoding arepressor specific for the repressible promoter whose function issensitive to an external stimulus. Without application of the externalstimulus, the structural gene is not expressed. Upon application of thestimulus, repressor function is inhibited, the recombinase is expressedand effects the removal of the blocking sequence at the specificexcision sequences, thereby directly linking the structural gene and thetransiently-active promoter.

In a modification of this embodiment, the sequences encoding therecombinase can be introduced separately into the plant via a viralvector.

In an alternative embodiment, no repressor gene or repressible promotoris used. Instead, the recombinase gene is linked to agermination-specific promotor and introduced into a separate plant fromthe other sequences. The plant containing the transiently-activepromotor, blocking sequence, and structural gene is then hybridized withthe plant containing the recombinase gene, producing progeny thatcontain all of the sequences. When the second transiently-activepromotor becomes active, the recombinase removes the blocking sequencein the progeny, allowing expression of the structural gene in theprogeny, whereas it was not expressed in either parent.

In still another embodiment, the recombinase gene is simply linked to aninducible promoter. Exposure of the plant to the induce specific for theinducible promoter leads to the expression of the recombinase gene andthe excision of the blocking sequence.

In all of these embodiments, the structural gene is expressed when thetransiently-active promoter becomes active in the normal course ofgrowth and development, and will continue to be expressed so long as thetransiently-active promoter is active, without the necessity ofcontinuous external stimulation. This system is particularly useful fordeveloping seed, where a particular trait is only desired during thefirst generation of plants grown from that seed, or a trait is desiredonly in subsequent generations.

5.7.9. Preparing Plants which are Resistant to Multiple Viruses

It is still a further object of the present invention to provide virusresistant transformed plant cells which contain a plurality of viralgenes, i.e., 2-7 or more genes, which are expressed as virus proteins,such as coat proteins, proteases and/or replicases, from the same virusstrain, from different virus strains as from different members of thevirus group, such as the potyvirus group. Representative viruses fromwhich these DNA sequences can be isolated include, but are not limitedto, potato virus X (PVX), potyviruses such as potato virus Y (PVY),cucomovirus (CMV), tobacco vein mottling virus, watermelon mosaic virus(WMV), zucchini yellow mosaic virus (ZYMV), bean common mosaic virus,bean yellow mosaic virus, soybean mosaic virus, peanut mottle virus,beet mosaic virus, wheat streak mosaic virus, maize dwarf mosaic virus,sorghum mosaic virus, sugarcane mosaic virus, johnsongrass mosaic virus,plum pox virus, tobacco etch virus, sweet potato feathery mottle virus,yam mosaic virus, and papaya ringspot virus (PRV), cucomoviruses,including CMA and comovirus.

5.7.3.4.1 Using Short Fragments of Viral Genes to Impart Resistance

Rather than attempting to incorporate full length viral genes in aplant, the present invention uses short fragments of such genes toimpart resistance to the plant against a plurality of viral pathogens.These short fragments, which each by themselves have insufficient lengthto impart such resistance, are more easily and cost effectively producedthan full length genes. There is no need to include in the plantseparate promoters for each of the fragments; only a single promoter isrequired. Moreover, such viral gene fragments can preferably beincorporated in a single expression system to produce transgenic plantswith a single transformation event.

5.7.10 Imparting Other Traits to Plants

In addition to conferring on plants resistance to multiple viraldiseases, the present invention can be utilized to impart other traitsto plants. It is often desirable to incorporate a number of traits to atransgenic plant besides disease resistance. For example, color, enzymeproduction, etc. may be desirable traits to confer on a plant. However,transforming plants with a plurality of such traits encounter the samedifficulties discussed above with respect to disease resistance. Thepresent invention may be likewise useful in alleviating these problemswith respect to traits other than disease resistance.

Thus, the present invention provides a genetic engineering methodologyby which multiple traits can be manipulated and tracked as a single geneinsert, i.e., as a construct which acts as a single gene whichsegregates as a single Mendelian locus. Although the invention isexemplified via virus resistance genes, in practice, any combination ofgenes could be linked. Therefore one could track a block of genes thatprovide traits such as disease resistance, plus enhanced herbicideresistance, plus extended shelf life, and the like, by simply trackingthe linked selectable marker or reporter gene which has beenincorporated into the transformation vector.

It was also discovered that when multiple tandem genes are inserted,they preferably all exhibit substantially the same degrees of efficacy,and more preferably substantially equal degrees of efficacy, wherein theterm “substantial” as it relates to viral resistance is defined withreference to the assays described in the examples hereinbelow. Forexample, if one examines numerous transgenic lines containing an intactZYMV and WMV-2 coat protein insert, one finds that if a line is immuneto infection by ZYMV it is also immune to infection by WMV-2. Similarly,if a line exhibits a delay in symptom development to ZYMV it will alsoexhibit a delay in symptom development to WMV2. Finally, if a line issusceptible to ZYMV it win be susceptible to WMV-2. This phenomenon isunexpected. If there were not a correlation between the efficacy of eachgene in these multiple gene constructs this approach as a tool in plantbreeding would probably be prohibitively difficult to use. Even withsingle gene constructs, one must test numerous transgenic plant lines tofind one that displays the appropriate level of efficacy. Theprobability of finding a line with useful levels of expression can rangefrom 10-50% (depending on the species involved).

If the efficacy of individual genes in a Ti plasmid containing multiplegenes were independent, the probability of finding a transgenic linethat was resistant to each targeted virus would decrease dramatically.For example, in a species in which there is a 10% probability ofidentifying a line with resistance using a single gene insert, istransformed with a triple-gene construct CZW and each gene display anindependent levels of efficacy, the probability of finding a line withresistance to CMV, ZYMV and WMV-2 would be 0.1×0.1×0.1=0.001 or 0.1%.However, since the efficacy of multivalent genes is not independent ofeach other the probability of finding a line with resistance to CMV,ZYMV and WMV-2 is still 10% rather than 0.1%. Obviously this advantagebecomes more pronounced as constructs containing four or more genes areused.

5.7.0.1. Production of a Mature Heterologous Protein in TransformedMonocot Plant Cells

In one aspect, the invention includes a method of producing, in monocotplant cells, a mature heterologous protein of interest. The methodincludes obtaining monocot cells transformed with a chimeric gene having(i) a monocot transcriptional regulatory region, inducible by additionor removal of a small molecule, or during seed maturation, (ii) a firstDNA sequence encoding the heterologous protein, and (iii) a second DNAsequence encoding a signal peptide. The second DNA sequence is operablylinked to the transcriptional regulatory region and to the first DNAsequence. The first DNA sequence is in translation-frame with the secondDNA sequence, and the two sequences encode a fusion protein.

5.7.10.2 Inducing the Transcriptional Regulatory Region

In other embodiments of the method, the transcriptional regulatoryregion may be a promoter derived from a rice or barley α-amylase gene,including RAmy1A, RAmy1B, RAmy2A, RAmy3A, RAmy3B, RAmy3C, RAmy3D,RAmy3E, pM/C, gKAmyl41, gKAmyl55, Amy32b, or HV18. The chimeric gene mayfurther include, between the transcriptional regulatory region and thefusion protein coding sequence, the 5′ untranslated region (5′ UTR) ofan inducible monocot gene such as one of the rice or barley α-amylasegenes described above. One preferred 5′ UTR is that from the RAmy1Agene, which is effective to enhance the stability of the genetranscript. The chimeric gene may further include, downstream of thecoding sequence, the 3′ untranslated region (3′ UTR) from an induciblemonocot gene, such as one of the rice or barley α-amylase genesmentionedabove. One preferred 3′UTR is from the RAmylA gene.

Where the method is employed in protein production in a monocot cellculture, preferred promoters are the RAmy3D and RAmy3E gene promoters,which are upregulated by sugar depletion in cell culture. Where the geneis employed in protein production in germinating seeds, a preferredpromoter is the RAmylA gene promoter, which is upregulated bygibberellic acid during seed germination. Where gene is upregulatedduring seed maturation, a preferred promoter is the barleyendosperm-specific B1-hordein promoter.

5.7.11 Indentifying such Loci and Discrinating Gene Effects ofRestriction Fragment Length Polymorphisms

The development of new methods to identify and characterize the role ofindividual plant genes in quantitative trait expression has significantimpact on plant improvement. Following the development of a new class ofplant molecular markers based on restriction fragment lengthpolymorphisms, termed “RFLPs”, (Helentjaris et al., Plant Mol. Bio. 5:109-118 (1985)) (“Helentjaris et al. I”), the processes to identify suchloci and discriminate gene effects have been invented and are describedand claimed herein. Additional RFLP reference is (Roberts, Nuc. AcidsRes. 10: 117-144 (1982)). The utility of isozyme markers ormorphological markers in studies is frequently limited by a lack ofinformativeness in lines of interest or by an insufficient availabilityor chromosomal distribution of the loci. Over 300 RFLPs covering all tenmaize chromosomes have been characterized (Helentjaris et al., Trends inGenetics. 3: 217-221 (1987)). Various plant genetic linkage maps basedon RFLP markers have been constructed (Helentjaris et al., Theor. Appl.Genet. 72: 761-769 (1986); Brassica Figdore et al., Theor. Appl.Genetics. 75: 833-840 (1988); Slocum et al., In “Genetic Maps” (S. J.O'Brien, ed.), 5th Edition, Cold Spring Harbor Press, N.Y. (1990);Wright et al., MNL 61: 89-90 (1987); Helentjaris et al., Weber andHelentjaris, Genetics 121: 583-590 (1989).

5.7.12 Isozyme Variation in Plant Breeding

The use of isozyme variation in plant breeding is like RFLP technology,one of indirect selection. (Tanskley and Orton, Isozymes in PlantGenetics and Breeding 1B (Elsevier, N.Y. 1983; Vallejos and Tanksley,Theor. Appl. Genet. 66: 241-247 (1983); Stuber et al., Crop. Sci. 22:737-740 (1982)).

Maize is perhaps the best characterized plant system in terms ofisozymes and yet only about two dozen isozyme loci have been located andit is rare for more than a dozen.

Particular Definitions

The terms below have the following meaning, unless indicated otherwisein the specification.

As used in this specification, a “transiently-active promoter” is anypromoter that is active either during a particular phase of plantdevelopment or under particular environmental conditions, and isessentially inactive at other times.

A “plant active promoter” is any promoter that is active in cells of aplant of interest. Plant-active promoters can be of viral, bacterial,fungal, animal or plant origin.

A gene that results in an altered plant phenotype is any gene whoseexpression leads to the plant exhibiting a trait or traits that woulddistinguish it from a plant of the same species not expressing the gene.Examples of such altered phenotypes include a different growth habit,altered flower or fruit color or quality, premature or late flowering,increased or decreased yield, sterility, mortality, diseasesusceptibility, altered production of secondary metabolites, or analtered crop quality such as taste or appearance.

A gene and a promoter are to be considered to be operably linked if theyare on the same strand of DNA, in the same orientation, and are locatedrelative to one another such that the promoter directs transcription ofthe gene (i.e. in cis). The presence of intervening DNA sequencesbetween the promoter and the gene does not preclude an operablerelationship.

A “blocking sequence” is a DNA sequence of any length that blocks apromoter from effecting expression of a targeted gene.

A “specific excision sequence” is a DNA sequence that is recognized by asite-specific recombinase.

A “recombinase” is an enzyme that recognizes a specific excisionsequence or set of specific excision sequences and effects the removalof, or otherwise alters, DNA between specific excision sequences.

A “repressor element” is a gene product that acts to prevent expressionof an otherwise expressible gene. A repressor element can compriseprotein, RNA or DNA.

A “repressible promoter” is a promoter that is affected by a repressorelement, such that transcription of the gene linked to the repressiblepromoter is prevented.

“Expression” means transcription or transcription followed bytranslation of a particular DNA molecule.

As used herein, with respect to a DNA sequence or “gene”, the term“isolated” is defined to mean that the sequence is either extracted fromits context in the viral genome by chemical means and purified and/ormodified to the extent that it can be introduced into the presentvectors in the appropriate orientation, i.e., sense or antisense.

“Cell culture” refers to cells and cell clusters, typically calluscells, growing on or suspended in a suitable growth medium.

“Germination” refers to the breaking of dormancy in a seed and theresumption of metabolic activity in the seed, including the productionof enzymes effective to break down starches in the seed endosperm.

“Inducible” means a promoter that is upregulated by the presence orabsence of a small molecules. It includes both indirect and directinducement.

“Inducible during germination” refers to promoters which aresubstantially silent but not totally silent prior to germination but areturned on substantially (greater than 25%) during germination anddevelopment in the seed. Examples of promoters that are inducible duringgermination are presented below.

“Small molecules”, in the context of promoter induction, are typicallysmall organic or bioorganic molecules less than about 1 kDal. Examplesof such small molecules include sugars, sugar-derivatives (includingphosphate derivatives), and plant hormones (such as, gibberellic orabsissic acid).

“Specifically regulatable” refers to the ability of a small molecule topreferentially affect transcription from one promoter or group ofpromoters (e.g., the a-amylase gene family), as opposed to non-specificeffects, such as, enhancement or reduction of global transcriptionwithin a cell by a small molecule.

“Seed maturation” or “grain development” refers to the period startingwith fertilization in which metabolizable reserves, e.g., sugars,oligosaccharides, starch, phenolics, amino acids, and proteins, aredeposited, with and without vacuole targeting, to various tissues in theseed (grain), e.g., endosperm, testa, aleurone layer, and scutellarepithelium, leading to grain enlargement, grain filling, and ending withgrain desiccation.

“Inducible during seed maturation” refers to promoters which are turnedon substantially (greater than 25%) during seed maturation.

“Heterologous” is defined to mean not identical, e.g. different innucleotide and/or amino acid sequence, phenotype or an independentisolate.

“Heterologous DNA” or “foreign DNA” refers to DNA which has beenintroduced into plant cells from another source, or which is from aplant source, including the same plant source, but which is under thecontrol of a promoter or terminator that does not normally regulateexpression of the heterologous DNA.

“Heterologous protein” is a protein, including a polypeptide, encoded bya heterologous DNA. A “transcription regulatory region” or “promoter”refers to nucleic acid sequences that influence and/or promoteinitiation of transcription. Promoters are typically considered toinclude regulatory regions, such as enhancer or inducer elements.

“Chimeric” is defined to mean the linkage of two or more DNA sequenceswhich are derived from different sources, strains or species, i.e., frombacteria and plants, or that two or more DNA sequences from the samespecies are linked in a way that does not occur in the native genome.Thus, the DNA sequences useful in the present invention may be naturallyoccurring, semi-synthetic or entirely synthetic. The DNA sequence may belinear or circular, Le, may be located on an intact or linearizedplasmid, such as the binary plasmids described below.

A “chimeric gene,” in the context of the present invention, typicallycomprises a promoter sequence operably linked to DNA sequence thatencodes a heterologous gene product, e.g., a selectable market gene or afusion protein gene. A chimeric gene may also contain furthertranscription regulatory elements, such as transcription terminationsignals, as well as translation regulatory signals, such as, terminationcodons.

“Operably linked” refers to components of a chimeric gene or anexpression cassette that function as a unit to express a heterologousprotein. For example, a promoter operably linked to a heterologous DNA,which encodes a protein, promotes the production of functional mRNAcorresponding to the heterologous DNA.

A “product” encoded by a DNA molecule includes, for example, RNAmolecules and polypeptides.

“Removal” in the context of a metabolite includes both physical removalas by washing and the depletion of the metabolite through the absorptionand metabolizing of the metabolite by the cells.

“Substantially isolated” is used in several contexts and typicallyrefers to the at least partial purification of a protein or polypeptideaway from unrelated or contaminating components.

Methods and procedures for the isolation or purification of proteins orpolypeptides are known in the art.

“Stably transformed” as used herein refers to a cereal cell or plantthat has foreign nucleic acid stably integrated into its genome which istransmitted through multiple generations.

“α1-antitrypsin or “AAT” refers to the protease inhibitor which has anamino acid sequence substantially identical or homologous to AATprotein.

“Antithrombin III” or “ATIII” refers to the heparin-activated inhibitorof thrombin and factor Xa, and which has an amino acid sequencesubstantially identical or homologous to ATIII protein.

Human serum albumin” or “HSA” refers to a protein which has an aminoacid sequena substantially identical or homologous to the mature HSAprotein.

“Subtilisin” or “subtilisin BPN′” or “BPN′” refers to the proteaseenzyme produced naturally by B. amyloliquefaciens.

“proBPN” refers to a form of BPN′ having an approximately 78 amino-acid“pro” moiety that functions as a chaperon polypeptide to assist infolding and activation of the BPN'.

“Codon optimization” refers to changes in the coding sequence of a geneto replace native codons with those corresponding to optimal codons inthe host plant.

A DNA sequence is “derived from” a gene, such as a rice or barleyα-amylase gene, if it corresponds in sequence to a segment or region ofthat gene. Segments of genes which may be derived from a gene includethe promoter region, the 5′ untranslated region, and the 3′ untranslatedregion of the gene.

5.7.13 General Approach

Generally, the nomenclature and laboratory procedures with respect tostandard recombinant DNA technology can be found in Sambrook, et al.,MOLECULAR-CLONING—A LABORATORY MANUAL, Cold Spring Harbor Laboratory,Cold Spring Harbor, N.Y. 1989 and in S. B. Gelvin and R. A. Schilperoot,PLANT MOLECULAR BIOLOGY, 1988. Other general references are providedthroughout this document. The procedures therein are known in the artand are provided for the convenience of the reader.

Most of the recombinant DNA methods employed in practicing the presentinvention are standard procedures, well known to those skilled in theart, and described in detail in, or example, European Patent ApplicationPublication Number 223,452, published Nov. 29, 1986, which isincorporated herein by reference. Enzymes are obtained from commercialsources and are used according to the vendor's recommendations or othervariations known in the art. General references containing such standardtechniques include the following: R. Wu, ed. (1979) Methods Emzymology,Vol. 68; J. H. Miller (1972) Experiments in Molecular Genetics; J.Sambrook et al. (1989) Molecular Cloning: A Laboratory Manual 2nd Ed.;D. M. Glover, ed. (1985) DNA Cloning Vol. II; H. G. Polites and K. R.Marotti (1987) “A step-wise protocol for cDNA synthesis, “Biotechniques4; 514-520; S. B. Gelvin and R. A. Schilperoort, eds. Introduction,Expression, and Analysis of Gene Products in Plants, all of which areincorporated by reference.

The recombinase/excision sequence system can be any one that selectivelyremoves DNA in a plant genome. The excision sequences are preferablyunique in the plant, so that unintended cleavage of the plant genomedoes not occur. Several examples of such systems are discussed in Sauer,U.S. Pat. No. 4,959,317 and in Sadowski (1993). A preferred system isthe bacteriophage CRE/LOX system, wherein the CRE protein performssite-specific recombination of DNA at LOX sites. Other systems includethe resolvases (Hall, 1993), FLP (Pan, et al., 1993), SSVI encodedintegrase (Muskhekishvili, et al., 1993), and the maize Ac/Ds transposonsystem (Shen and Hohn, 1992).

5.7.13.1 Control of Plant Gene Expression

5.7.13.1.1 Using a Transiently-Active Promoter

This invention relates to a method of creating transgenic plants whereinthe expression of certain plant traits is ultimately under externalcontrol. In one embodiment the control is achieved through applicationof an external stimulus; in another embodiment it is achieved throughhybridization, in still another embodiment it is achieved by directintroduction of a recombinase or recombinase gene into a plant. Thetransgenic plants of the present invention are prepared by introducinginto their genome a series of functionally interrelated DNA sequences,containing the following basic elements: a plant-active promoter that isactive at a particular stage in plant development or under particularenvironmental conditions (“transiently-active promoter”), a gene whoseexpression results in an altered plant phenotype which is linked to thetransiently-active promoter through a blocking sequence separating thetransiently-active promoter and the gene, unique specific excisionsequences flanking the blocking sequence, wherein the specific excisionsequences are recognizable by a site-specific recombinase, a geneencoding the site-specific recombinase, an alternative repressiblepromoter linked to the recombinase gene, and an alternative gene thatencodes the repressor specific for the repressible promoter, the actionof the repressor being responsive to an applied or exogenous stimulus.While these elements may be arranged in any order that achieves theinteractions described below, in one embodiment they are advantageouslyarranged as follows: a first DNA sequence contains thetransiently-active promotor, a first specific excision sequence, theblocking sequence, a second specific excision sequence, and the genewhose expression results in an altered plant phenotype; a second DNAsequence contains the repressible promoter operably linked to therecombinase gene, and optionally an enhancer; and a third DNA sequencecontaining the gene encoding the repressor specific for the repressiblepromoter, itself linked to a promoter functional and constitutive inplants. The third DNA sequence can conveniently act as the blockingsequence located in the first DNA sequence, but can also occurseparately without altering the function of the system. This embodimentcan be modified such that the recombinase sequence is introducedseparately via a viral vector. In an alternative embodiment, anadvantageous arrangement is as follows: a first plant containing a DNAsequence comprising the transiently-active promotor, a first specificexcision signal sequence, the blocking sequence, a second specificexcision signal sequence, and the gene whose expression results in analtered plant phenotype; a second plant containing a DNA sequencecomprising a constitutive plant-active promotor operably linked to therecombinase gene (the two plants being hybridized to produce progenythat contain all of the above sequences).

When a plant contains the basic elements of either embodiment, the genewhose expression results in an altered plant phenotype is not active, asit is separated from its promoter by the blocking sequence. In the firstembodiment, absent the external stimulus, the repressor is active andrepresses the promoter that controls expression of the recombinase; inthe alternative embodiment the recombinase is not present in the sameplant as the first DNA sequence. Such a plant will not display thealtered phenotype, and will produce seed that would give rise to plantsthat also do not display the altered phenotype. When the 5 stimulus towhich the repressor is sensitive is applied to this seed or this plant,the repressor no longer functions, permitting the expression of thesite-specific recombinase, or alternatively, when the recombinase isintroduced via hybridization it is expressed during germination of theseed, either of which effects the removal of the blocking sequencebetween the specific excision signal sequences. Upon removal of theblocking sequence, the transiently-active promoter becomes directlylinked to the gene whose expression results in an altered plantphenotype. A plant grown from either treated or hybrid seed, or atreated plant, will still not exhibit the altered phenotype, until thetransiently-active promoter becomes active during the plant'sdevelopment, after which the gene to which it is linked is expressed,and the plant will exhibit an altered phenotype.

5.7.13.2 Transgenic Plants that Produce Seeds that Cannot Germinate

In a preferred embodiment, the present invention involves a transgenicplant or seed which, upon treatment with an external stimulus producesplants that produce seed that cannot germinate (but that is unaltered inother respects). If the transiently-active promoter is one that isactive only in late embryogenesis, the gene to which it is linked willbe expressed only in the last stages of seed development or maturation.If the gene linked to this promoter is a lethal gene, it will render theseed produced by the plants incapable of germination. In theinitially-transformed plant cells, this lethal gene is not expressed,not only because the promoter is intrinsically inactive, but becausethere is a blocking sequence separating the lethal gene from itspromoter. Also within the genome of these cells are the genes for therecombinase, linked to a repressible promoter, and the gene coding forthe repressor. The repressor is expressed constitutively and repressesthe expression of the recombinase. These plant cells can be regeneratedinto a whole plant and allowed to produce seed. The mature seed isexposed to a stimulus, such as a chemical agent, that inhibits thefunction of the repressor. Upon inhibition of the repressor, thepromotor driving the recombinase gene is depressed and the recombinasegene is expressed. The resulting recombinase recognizes the specificexcision sequences flanking the blocking sequence, and effects theremoval of the blocking sequence. The late embryogenesis promoter andthe lethal gene are then directly linked. The lethal gene is notexpressed, however, because the promoter is not active at this time inthe plant's life cycle. This seed can be planted, and grown to produce adesired crop of plants. As the crop matures and produces a secondgeneration of seed, the late embryogenesis promoter becomes active, thelethal gene is expressed in the maturing second generation seed, whichis rendered incapable of germination. In this way, accidental reseeding,escape of the crop plant to areas outside the area of cultivation, orgermination of stored seed can be avoided.

5.7.13.1.3 Trangenic Plants Hybridized to Display Phenotype Not Seen inEither Parent

In an alternative preferred embodiment, the present invention involves apair of transgenic plants that are hybridized to produce progeny thatdisplay a phenotype not seen in either parent. In this alternativeembodiment a transiently-active promotor that is active only in lateembryogenesis can be linked to a lethal gene, with an interveningblocking sequence bounded by the specific excision sequences. Thesegenetic sequences can be introduced into plant cells to produce onetransgenic parent plant. The recombinase gene is linked to agermination-specific promotor and introduced into separate plant cellsto produce a second transgenic parent plant. Both of these plants canproduce viable seed if pollinated. If the first and second transgenicparent plants are hybridized, the progeny will contain both the blockedlethal gene and the recombinase gene. The recombinase is expressed upongermination of the seed and effects the removal of the blockingsequence, as in the first embodiment, thereby directly linking thelethal gene and the transiently-active promotor. As in the firstembodiment, this promotor becomes active during maturation of the secondgeneration seed, resulting in seed that is incapable of germination.Ideally, the first parent employs a male-sterility gene as the blockingsequence, and includes an herbicide resistance gene. In this way,self-pollination of the first transgenic parent plant is avoided, andself-pollinated second transgenic parent plants can be eliminated byapplication of the herbicide. In the hybrid progeny, the male-sterilitygene is removed by the recombinase, resulting in hybrid progeny capableof self-pollination.

5.7.13.1.4. Linking the Recombinase Gene to an Inducible Promoter

In another embodiment, the recombinase gene is linked to an induciblepromoter. Examples of such promoters include the copper, controllablegene expression system (Mett et al., 1993) and the steroid-induciblegene system (Schena et al., 1991). Exposure of the transgenic plant tothe inducer specific for the inducible promoter leads to expression ofthe recombinase gene and the excision of the blocking sequence. The genethat results in an altered plant phenotype is then expressed when thetransiently active promoter becomes active.

The gene or genes linked to the plant development promoter can be anygene or genes whose expression results in a desired detectablephenotype. In one method, Gene(s) can be Linked to the Plant DevelopmentPromoter to control expression developmentally.

In those embodiments employing a repressible promoter system, the geneencoding the repressor is responsive to an outside stimulus, or encodesa repressor element that is itself responsive to an outside stimulus, sothat repressor function can be controlled by the outside stimulus. Thestimulus is preferably one to which the plant is not normally exposed,such as a particular chemical, temperature shock, or osmotic shock. Apreferred system is the Tn10 tet repressor system, which is responsiveto tetracycline. Gatz and Quail (1988); Gatz, et al. (1992). Examples ofother repressible promoter systems are described by Lanzer and Bujard(1988) and Ptashne, et al.

5.7.14. Conferring Viral Resistance to the Plant

To practice the present invention, a viral gene must be isolated fromthe viral genome and inserted into a vector containing the geneticregulatory sequences necessary to express the inserted gene.Accordingly, a vector must be constructed to provide the regulatorysequences such that they will be functional upon inserting a desiredgene. When the expression vector/insert construct is assembled, it isused to transform plant cells which are then used to regenerate plants.These transgenic plants carry the viral gene in the expressionvector/insert construct. The gene is expressed in the plant andincreased resistance to viral infection is conferred thereby.

5.7.14.1. Table of Selected Literature References to Methods ofIsolating, Cloning and Expressing Viral Genes

The nucleotide sequences encoding the coat protein genes and nuclearinclusion genes of a number of viruses have been determined and thegenes have been inserted into expression vectors. The expression vectorscontain the necessary genetic regulatory sequences for expression of aninserted gene. The coat protein gene is inserted such that thoseregulatory sequences are functional and the genes can be expressed whenincorporated into a plant genome. Selected literature references tomethods of isolating, cloning and expressing viral genes are listed onTable 3, below. TABLE 3 Cloned Genes From RNA Viruses Viral GeneReference Papaya ringspot cp M. M. Fitch et al., Bio/Technology, 10,1466(1992) Potato virus X cp K. Ling et al., Bio/Technology, 2, 752(1991); A. Hoekema et al., Bio/Technology, 7, 273 (1989) WatermelonMosaic Virus H. Quemada et al., J. Gen. Virol., 71, 1451 II cp (1990);S. Namba et al., Phytopathology, 82, 940 (1992) Zucchini yellow MosaicS. Namba et al., Phytopathology, 82, 940 Virus cp (1992) Tobacco MosaicVirus cp R. S. Nelson et al., Bio/Technology, 6, 403 (1988); P. PowellAbel et al., Science, 232, 738 (1986) Alfalfa Mosaic Virus cpLoesch-Fries et al., EMBO J., 6, 1845 (1987); N. E. Turner et al., EMBOJ., 6, 1181 (1987) Soybean Mosaic Virus cp D. M. Stark et al.,Biotechnology, 7, 1257 (1989) Cucumber Mosaic Virus H. Q. Quemada etal., Molec. Plant Pathol., strain C cp 81, 794 (1991) Cucumber MosaicVirus UpJohn Co. (PCT W090/02185) strain WL cp Tobacco etch virus cpAllison et al., Virology, 147, 309 (1985) Tobacco etch virus nu- J. C.Carrington et al., J. Virol., 61, 2540 clear inclusion protein (1987)Pepper Mottle Virus cp W. G. Dougherty et al., Virology, 146, 282 (1985)Potato virus Y cp D. D. Shukla et al., Virology, 152, 118 (1986) Potatovirus Y nuclear European Patent Application 578, 627 inclusion proteinPotato virus X cp C. Lawson et al., Biotechnology, 8, 127 (1990) Tobaccostreak virus C. M. Van Dun et al., Virology, 164, 383 (TSV) cp (1988)5.7.15 Disease Resistance

One aspect of the present invention relates to the use of trait DNAmolecules which are heterologous to the plant—e.g., DNA molecules thatconfer disease resistance to plants transformed with the DNA construct.

5.7.15.1 Sense/Antisense Orientation

The DNA molecule conferring disease resistance can be positioned withinthe DNA construct in sense orientation. Alternatively, it can have anantisense orientation. Antisense RNA technology involves the productionof an RNA molecule that is complementary to the messenger RNA moleculeof a target gene; the antisense RNA can potentially block all expressionof the targeted gene. In the anti-virus context, plants are made toexpress an antisense RNA molecule corresponding to a viral RNA (that is,the antisense RNA is an RNA molecule which is complementary to a plussense RNA species encoded by an infecting virus). Such plants may show aslightly decreased susceptibility to infection by that virus. Such acomplementary RNA molecule is termed antisense RNA.

5.7.16 Other Traits

The present invention is also used to confer traits other than diseaseresistance on plants. For example, DNA molecules which impart a plantgenetic trait can be used as the DNA trait molecule of the presentinvention. In this aspect of the present invention, suitable trait DNAmolecules encode for desired color, enzyme production, or combinationsthereof.

5.7.16.1 Gene Silencing

The silencer DNA molecule of the present invention can be selected fromvirtually any nucleic acid which effects gene silencing. This involvesthe cellular mechanism to degrade mRNA homologous to the transgene mRNA.The silencer DNA molecule can be heterologous to the plant, need notinteract with the trait DNA molecule in the plant, and can be positioned3′ to the trait DNA molecule. For example, the silencer DNA molecule canbe a viral cDNA molecule, a jellyfish green fluorescence proteinencoding DNA molecule, a plant DNA molecule, or combinations thereof.

While not wishing to be bound by theory, by use of the construct of thepresent invention, it is believed that post-transcriptional genesilencing is achieved. More particularly, the silencer DNA molecule isbelieved to boost the level of heterologous RNA within the cell above athreshold level. This activates the degradation mechanism by which viralresistance is achieved.

5.7.16.1.1 Trait & Silencer DNA Molecules Encoding RNAMolecules—Translatable

It is possible for the DNA construct of the present invention to beconfigured so that the trait and silencer DNA molecules encode RNAmolecules which are translatable. As a result, that RNA molecule will betranslated at the ribosomes to produce the protein encoded by the DNAconstruct. Production of proteins in this manner can be increased byjoining the cloned gene encoding the DNA construct of interest withsynthetic double-stranded oligonucleotides which represent a viralregulatory sequence (i.e., a S′ untranslated sequence). See U.S. Pat.No. 4,820,639 to Gehrke and U.S. Pat. No. 5,849,527 to Wilson which arehereby incorporated by reference.

5.7.16.1.2 Trait & Silencer DNA Molecules Encoding RNA Molecules—NotTranslatable

Alternatively, the DNA construct of the present invention can beconfigured so that the trait and silencer DNA molecules encode mRNAwhich is not translatable. This is achieved by introducing into the DNAmolecule one or more premature stop codons, adding one or more bases(except multiples of 3 bases) to displace the reading frame, removingthe translation initiation codon, etc. See U.S. Pat. No. 5,583,021 toDougherty et al., which is hereby incorporated by reference.

5.7.17 Recombinant DNA Technology

The subject DNA construct can be incorporated in cells usingconventional recombinant DNA technology. Generally, this involvesinserting the DNA construct into an expression system to which the DNAconstruct is heterologous (i.e. not normally present). The heterologousDNA construct may be inserted into the expression system or vector inproper sense orientation and correct reading frame. The vector containsthe necessary elements for the transcription of the inserted sequences.

5.7.18 Incorporation into a Host Cell

Once the DNA construct has been cloned into an expression system, it isready to be incorporated into a host cell. Such incorporation can becarried out by the various forms of transformation noted above,depending upon the vector/host cell system. Suitable host cells include,but are not limited to, bacteria, virus, plant, is and the like cells.

5.7.18.1 Plant Transformation—Production of Mature Proteins in Plants

Expression vectors for use in the present invention comprise a chimericgene (or expression cassette), designed for operation in plants, withcompanion sequences upstream and downstream from the expressioncassette. The companion sequences will be of plasmid or viral origin andprovide necessary characteristics to the vector to permit the vectors tomove DNA from bacteria to the desired plant host. For transformation ofplants, the chimeric gene is placed in a suitable expression vectordesigned for operation in plants. The vector includes suitable elementsof plasmid or viral origin that provide necessary characteristics to thevector to permit the vectors to move DNA from bacteria to the desiredplant host. Suitable components of the expression vector, including aninducible promoter, coding sequence for a signal peptide, codingsequence for a mature heterologous protein, and suitable terminationsequences, are discussed below. One exemplary vector is the p3Dvl.O(p3D(AAT)v1.0) vector described herein.

5.7.18.1.2 Transformation Vector

Vectors containing a chimeric gene of the present invention may alsoinclude selectable markers for use in plant cells (such as the nptIIkanamycin resistance gene, for selection in kanamycin-containing or thephosphinothricin acetyltransferase gene, for selection in mediumcontaining phosphinothricin (PPT).

The vectors may also include sequences that allow their selection andpropagation in a secondary host, such as sequences containing an originof replication and a selectable marker such as antibiotic or herbicideresistance genes, e.g., HPH (Hagio et al., Plant Cell Reports 14: 329(1995); van der Elzer, Plant Mol. Biol. 5: 299-302 (1985)). Typicalsecondary hosts include bacteria and yeast. In one embodiment, thesecondary host is Escherichia coli, the origin of replication is acolEl-type, and the selectable marker is a gene encoding ampicillinresistance. Such sequences are well known in the art and arecommercially available as well (e.g., Clontech, Palo Alto, Calif.;Stratagene, La Jolla, Calif.).

The vectors of the present invention may also be modified tointermediate plant transformation plasmids that contain a region ofhomology to an Agrobacterium tumefaciens vector, a T-DNA border regionfrom Agrobacterium tumefaciens, and chimeric genes or expressioncassettes (described above). Further, the vectors of the invention maycomprise a disarmed plant tumor inducing plasmid of Agrobacteriumtumefaciens.

5.7.19 Plant Expression Vector Production of Mature Proteins in Plants

5.7.19.1 Suitable Vectors

Suitable vectors include, but are not limited to, the following viralvectors such as lambda vector system gt11, gt WES.tB, Charon 4, andplasmid vectors such as pER322, pBR325, pACYC177, pACYC1084, pUC8, pUC9,pUC18, pUC19, pLG339, pR290, pKC37, pKC101, SV 40, pBluescript II SK+/−or KS+/− (see “Stratagene Cloning Systems” Catalog (1993) fromStratagene, La Jolla, Calif., which is hereby incorporated byreference), pQE, pIH821, pGEX, pET series (see F. W. Studier et. al.,“Use of T7 RNA Polymerase to Direct Expression of Cloned Genes,” GeneExpression Technolog vol. 185 (1990), which is hereby incorporated byreference), and any derivatives thereof. Recombinant molecules can beintroduced into cells via transformation, particularly transduction,conjugation, mobilization, or electroporation. The DNA sequences arecloned into the vector using standard cloning procedures in the art, asdescribed by Sambrook et al., Molecular Cloning: A Laboratory Manual,Cold Springs Laboratory, Cold Springs Harbor, N.Y. (1989), which ishereby incorporated by reference.

5.7.19.2 Host-Vector Systems

A variety of host-vector systems may be utilized to carry out thepresent invention. Primarily, the vector system must be compatible withthe host cell used. Host-vector systems include but are not limited tothe following: bacteria transformed with bacteriophage DNA, plasmid DNA,or cosmid DNA; microorganisms such as yeast containing yeast vectors;mammalian cell systems infected with virus (e.g., vaccinia virus,adenovirus, etc.); insect cell systems infected with virus (e.g.,baculovirus); and plant cells infected by bacteria. The expressionelements of these vectors vary in their strength and specificities.Depending upon the host-vector system utilized, any one of a number ofsuitable transcription and, perhaps, translation elements can be used.

5.7.19.3 Signal Sequences for Production of Mature Proteins

In addition to encoding the protein of interest, the chimeric geneencodes a signal sequence (or signal peptide) that allows processing andtranslocation of the protein, as appropriate. Suitable signal sequencesare described in above-referenced PCT application WO 95/14099. The plantsignal sequence is placed in frame with a heterologous nucleic acidencoding a mature protein, forming a construct which encodes a fusionprotein having an N-terminal region corresponding to the signal peptideand, immediately adjacent to the C-terminal amino acid of the signalpeptide, the N-terminal amino acid of the mature heterologous protein.The expressed fusion protein is subsequently secreted and processed bysignal peptidase cleavage precisely at the junction of the signalpeptide and the mature protein, to yield the mature heterologousprotein.

In another embodiment of the invention, the coding sequence in thefusion protein gene, in at least the coding region for the signalsequence, may be codon-optimized for optimal expression in plant cells,e.g., rice cells, as described below.

5.7.20 Promotors

Transcription Dependent upon the Presence of a Promotor

Transcription of DNA is dependent upon the presence of a promotor whichis a DNA sequence that directs the binding of RNA polymerase and therebypromotes mRNA synthesis. The DNA sequences of eucaryotic promotersdiffer from those of procaryotic promotors. Furthermore, eucaryoticpromotors and accompanying genetic signals may not be recognized in ormay not function in a procaryotic system, and, further, procaryoticpromoters are not recognized and do not function in eucaryotic cells.

The segment of DNA referred to as the promoter is responsible for theregulation of the transcription of DNA into mRNA. A number of promoterswhich function in plant cells are known in the art and may be employedin the practice of the present invention. These promoters may beobtained from a variety of sources such as plants or plant viruses, andmay include but are not limited to promoters isolated from thecaulimovirus group such as the cauliflower mosaic virus 35S promoter(CaMV35S), the enhanced cauliflower mosaic virus 35S promoter (enhCaMV35S), the figwort mosaic virus full-length transcript promoter(FMV35S), and the promoter isolated from the chlorophyll alb bindingprotein. Other useful promoters include promoters which are capable ofexpressing the potyvirus proteins in an inducible manner or in atissue-specific manner in certain cell types in which the infection isknown to occur. For example, the inducible promoters from phenylalanineammonia lyase, chalcone synthase, hydroxyproline rich glycoprotein,extensin, pathogenesis-related proteins (e.g. PR—I a), andwound-inducible protease inhibitor from potato may be useful.

Preferred promoters for use in the present viral gene expressioncassettes include the constitutive promoters from CaMV, the Ti genesnopaline synthase (Bevan et al., Nucleic Acids Res. 11, 369-385 (1983))and octopine synthase (Depicker et al., J. Mol. Appl. Genet., 1, 561-564(1982)), and the bean storage protein gene phaseolin. The poly(A)addition signals from these genes are also suitable for use in thepresent cassettes. The particular promoter selected is preferablycapable of causing sufficient expression of the DNA coding sequences towhich it is operably linked, to result in the production of amounts ofthe proteins or the RNAs effective to provide viral resistance, but notso much as to be detrimental to the cell in which they are expressed.The promoters selected should be capable of functioning in tissuesincluding but not limited to epidermal, vascular, and mesophyll tissues.The actual choice of the promoter is not critical, as long as it hassufficient transcriptional activity to accomplish the expression of thepreselected proteins or antisense RNA, and subsequent conferral of viralresistance to the plants.

Promotors vary in their “strength” (i.e. their ability to promotetranscription). For the purposes of expressing a cloned gene, it isdesirable to use strong promotors in order to obtain a high level oftranscription and, hence, expression of the gene. Depending upon thehost cell system utilized, any one of a number of suitable promotors maybe used. For instance, when cloning in E. coli, its bacteriophages, orplasmids, promotors such as the T7 phage promoter, lac promotor, trppromotor, recA promotor, ribosomal RNA promotor, the P_(R) and P_(L)promotors of coliphage lambda and others, including but not limited, tolacUV5, ompF, bla, Ipp, and the like, may be used to direct high levelsof transcription of adjacent DNA segments. Additionally, a hybridtrp-lacUV5 (tac) promotor or other E. coli promotors produced byrecombinant DNA or other synthetic DNA techniques may be used to providefor transcription of the inserted gene.

5.7.20.1 Promoters that Transcribe the cereal A-Amylase Genes andSucrose Synthase Genes

The transcription regulatory or promoter region is chosen to beregulated in a manner allowing for induction under selected cultivationconditions, e.g., sugar depletion in culture or water uptake followed bygibberellic acid production in germinating seeds. Suitable promoters,and their method of selection are detailed in above-cited PCTapplication WO 95/14099. Examples of such promoters include those thattranscribe the cereal a-amylase genes and sucrose synthase genes, andare repressed or induced by small molecules, like sugars, sugardepletion or phytohormones such as gibberellic acid or absissic acid.Representative promoters include the promoters from the rice a-amylaseRAmy1A, RAmy1B, RAmy2A, RAmy3A, RAmy3B, RAmy3C, RAmy3D, and RAmy3Egenes, and from the pM/C, gKAmyl41, gKAmyl55, Amy32b, and HV18 barley(x-amylase genes. These promoters are described, for example, inADVANCES IN PLANT BIOTECHNOLOGY Ryu, D. D. Y., et al, Eds., Elsevier,Amsterdam, 1994, p. 37, and references cited therein. Other suitablepromoters include the sucrose synthase andsucrose-6-phosphate-synthetase (SPS) promoters from rice and barley.

Other suitable promoters include promoters which are regulated in amanner allowing for induction under seed-maturation conditions. Examplesof such promoters include those associated with the following monocotstorage proteins: rice glutelins, oryzins, and prolamines, barleyhordeins, wheat gliadins and glutelins, maize zeins and glutelins, oatglutelins, and sorghum kafirins, millet pennisetins, and rye secalins.

A preferred promoter for expression in germinating seeds is the ricea-amylase RAmy1A promoter, which is upregulated by gibberellic acid.Preferred promoters for expression in cell culture are the ricea-amylase RAmy3D and RAmy3E promoters which are strongly upregulated bysugar depletion in the culture. These promoters are also active duringseed germination. A preferred promoter for expression in maturing seedsis the barley endosperm-specific BI-hordein promoter (Brandt, A., etal., (1985) Carlsberg Res. Commun. 50: 333-345).

The chimeric gene may further include, between the promoter and codingsequences, the 5′ untranslated region (5′ UTR) of an inducible monocotgene, such as the 5′ UTR derived from one of the rice or barleya-amylase genes mentioned above. One preferred 5′ UTR is that derivedfrom the RAmy1A gene, which is effective to enhance the stability of thegene transcript.

5.7.20.2 Use of Inducers

Bacterial host cell strains and expression vectors may be chosen whichinhibit the action of the promotor unless specifically induced. Incertain operations, the addition of specific inducers is necessary forefficient transcription of the inserted DNA. For example, the lac operonis induced by the addition of lactose or IPTG(isopropylthio-beta-D-galactoside). A variety of other operons, such astrp, pro, etc., are under different controls.

5.7.20.3 Transcription Initiation Signals

Specific initiation signals are also required for efficient genetranscription in procaryotic cells.

These transcription initiation signals may vary in “strength” asmeasured by the quantity of gene specific messenger RNA and proteinsynthesized, respectively. The DNA expression vector, which contains apromotor, may also contain any combination of various “strong”transcription initiation signals.

5.7.20.4 Translation Dependent on Shine-Dalgarno Sequence (inProcaryotes)

Similarly, translation of mRNA in procaryotes depends upon the presenceof the proper procaryotic signals which differ from those of eucaryotes.Efficient translation of mRNA in procaryotes requires a ribosome bindingsite called the Shine-Dalgamo (“SD”) sequence on the mRNA. This sequenceis a short nucleotide sequence of mRNA that is located before the startcodon, usually AUG, which encodes the amino-terminal methionine of theprotein. The SD sequences are complementary to the 31-end of the 16SrRNA (ribosomal RNA) and probably promote binding of mRNA to ribosomesby duplexing with the rRNA to allow correct positioning of the ribosome.For a review on maximizing gene expression, see Roberts and Lauer,Methods in Enzmmology, 68: 473 (1979), which is hereby incorporated byreference.

The non-translated leader sequence can be derived from any suitablesource and can be specifically modified to increase the translation ofthe mRNA. The 5′ non-translated region can be obtained from the promoterselected to express the gene, an unrelated promoter, the native leadersequence of the gene or coding region to be expressed, viral RNAs,suitable eucaryotic genes, or a synthetic gene sequence. The presentinvention is not limited to the constructs presented in the followingexamples.

5.7.20.5 A4. Codon-Optimized Coding Sequences

In accordance with one aspect of the invention, it has been discoveredthat a severalfold enhancement of expression level can be achieved inplant cell culture by modifying the native coding sequence of aheterologous gene by contain predominantly or exclusively,highest-frequency codons found in the plant cell host.

The method will be illustrated for expression of a heterologous gene inrice plant cells, it being recognized that the method is generallyapplicable to any monocot. As a first step, a representative set ofknown coding gene sequence from rice is assembled. The sequences arethen analyzed for codon frequency for each amino acid, and the mostfrequent codon is selected for each amino acid. This approach differsfrom earlier reported codon matching methods, in which more than onefrequent codon is selected for at least some of the amino acids. Theoptimal codons selected in this manner for rice and barley are shown inTable 4. TABLE 4 Amino Acid Rice Preferred Codon Barley Preferred CodonAla A GCC Arg R CGC Asn N AAC Asp D GAC Cys C UGC Gln Q CAG Glu E GAGGly G GGC His H CAC Ile I AUC Leu L CUC Lys K AAG Phe F UUC Pro P CCGCCC Ser S AGC UCC Thr T ACC Tyr Y UAC Val V GUC GUG stop UAA UGA

As indicated above, the fusion protein coding sequence in the chimericgene is constructed such that the final (C-terminal) codon in the signalsequence is immediately followed by the codon for the N-terminal aminoacid in the mature form of the heterologous protein. TABLE 5 Location(Asn) (in N-Glycosylation Sites mature protein) Amino Acid SubstitutionAsn Asn Ser 61 Thr Asn Ser Asn Asn Ser 76 Thr Asn Ser Asn Met Ser 123Thr Met Ser Asn Gly Thr 218 Ser Gly Thr′ Asn Trp Thr 240 Thr Trp Thr′improved thermostability; Bryan, et al., Proteins: Structure, Function,and Genetics 1: 326 (1986).5.7.20.6 Termination Sequence Coupled To The Fusion End

The present invention can also utilize a termination sequenceoperatively coupled to the fusion gene to end transcription. Suitabletranscription termination sequences include the termination region of a3′ non-translated region. This will cause the termination oftranscription and the addition of polyadenylated ribonucleotides to the3′ end of the transcribed mRNA sequence. The termination region or 31non-translated region will be additionally one of convenience. Thetermination region may be native with the promoter region or may bederived from another source, and preferably includes a terminator and asequence coding for polyadenylation. Suitable 3′non-translated regionsinclude but are not limited to: (1) the 3′ transcribed, non-translatedregions containing the polyadenylated signal of Agrobacteriumtumor-inducing (Ti) plasmid genes, such as the nopaline synthase (NOS)gene or the 35S promoter terminator gene, and (2) plant genes like thesoybean 7S storage protein genes and the pea small subunit of theribulose 1,5-bisphosphate carboxylase-oxygenase (ssRUBISCO) E9 gene.

The termination region or 3′ non-translated region which is employed isone which will cause the termination of transcription and the additionof polyadenylated ribonucleotides to the 3′ end of the transcribed mRNAsequence. The termination region may be native with the promoter region,native with the structural gene, or may be derived from another source,and preferably include a terminator and a sequence coding forpolyadenylation. Suitable 3′ non-translated regions of the chimericplant gene include but are not limited to: (1) the 3′ transcribed,non-translated regions containing the polyadenylation signal ofAgrobacterium tumor-inducing (Ti) plasmid genes, such as the nopalinesynthase (NOS) gene, and (2) plant genes like the soybean 7S storageprotein genes.

5.7.20.7 Transcription and Translation Terminators for Production ofMature Proteins

The chimeric gene may also include, downstream of the coding sequence,the 3′ untranslated region (Y UTR) from an inducible monocot gene, suchas one of the rice or barley a-amylase genes mentioned above. Onepreferred 3′ UTR is that derived from the RAmy1A gene. This sequenceincludes non-coding sequence 5′ to the polyadenylation site, thepolyadenylation site, and the transcription termination sequence. Thetranscriptional termination region may be selected, particularly forstability of the mRNA to enhance expression. Polyadenylation tails(Alber and Kawasaki, 1982, Mol. and Appl. Genet. 1: 419-434) are alsocommonly added to the expression cassette to optimize high levels oftranscription and proper transcription termination, respectively.Polyadenylation sequences include but are not limited to theAgrobacterium octopine synthetase signal (Gielen, et al., EMBO J. 3:835-846 (1984) or the nopaline synthase of the same species (Depicker,et al., Mol. Appl. Genet. 1: 561-573 (1982).

Since the ultimate expression of the heterologous protein will be in aeukaryotic cell (in this case, a member of the grass family), it isdesirable to determine whether any portion of the cloned gene containssequences which will be processed out as introns by the host's splicingmachinery. If so, site-directed mutagenesis of the “intron” region maybe conducted to prevent losing a portion of the genetic message as afalse intron code (Reed and Maniatis, Cell 41: 95-105 (1985).

5.7.20.9 Selectable Marker Gene

Selectable marker genes may be incorporated into the present expressioncassettes and used to select for those cells or plants which have becometransformed. The marker gene employed may express resistance to anantibiotic, such as kanamycin, gentamycin, G418, hygromycin,streptomycin, spectinomycin, tetracyline, chloramphenicol, and the like.

Other markers could be employed in addition to or in the alternative,such as, for example, a gene coding for herbicide tolerance such astolerance to glphosate, sulfonylurea, phosphinothricin, or bromoxynil.Additional means of selection could include resistance to methotrexate,heavy metals, complementation providing prototrophy to an auxotrophichost, and the like.

For example, see Table 1 of PCT WO/91/10725, cited above. The presentinvention also envisions replacing all of the virus-associated geneswith an array of selectable marker genes.

The particular marker employed will be one which will allow for theselection of transformed cells as opposed to those cells which were nottransformed. Depending on the number of different host species one ormore markers may be employed, where different conditions of selectionwould be useful to select the different host, and would be known tothose of skill in the art. A screenable marker or “reporter gene” suchas the 0-glucuronidase gene or luciferase gene may be used in place of,or with, a selectable marker. Cells transformed with this gene may beidentified by the production of a blue product on treatment with5-bromo-4-chloro-3-indoyl-β-D-glucuronide (X-Gluc).

In developing the present expression construct, the various componentsof the expression construct such as the DNA sequences, linkers, orfragments thereof will normally be insetted into a convenient cloningvector, such as a plasmid or phage, which is capable of replication in abacterial host, such as E. coli. Numerous cloning vectors exist thathave been described in the literature. After each cloning, the cloningvector may be isolated and subjected to further manipulation, such asrestriction, insertion of new fragments, ligation, deletion, resection,insertion, in vitro mutagenesis, addition of polylinker fragments, andthe like, in order to provide a vector which will meet a particularneed.

5.7.20.10 Transferring Recombinant DNA into Plant Cell

5.7.20.10.1 Use of Micropipettes or Polyethylene Glycol

In producing transgenic plants, the DNA construct in a vector describedabove can be microinjected directly into plant cells by use ofmicropipettes to transfer mechanically the recombinant DNA. Crossway,Mol. Gen. Genetics, 202: 179-85 (1985), which is hereby incorporated byreference. The genetic material may also be transferred into the plantcell using polyethylene glycol. Krens, et al., Nature, 296: 72-74(1982), which is hereby incorporated by reference.

5.7.4.12.2 Particle Bombardment (Biolistic Transformation) Anotherapproach to transforming plant cells with the DNA construct is particlebombardment (also known as biolistic transformation) of the host cell.This can be accomplished in one of several ways. The first involvespropelling inert or biologically active particles at cells. Thistechnique is disclosed in U.S. Pat. Nos. 4,945,050, 5,036,006, and5,100,792, all to Sanford et al., which are hereby incorporated byreference.

5.7.4.12.3 Fusion of Protoplasts with Other Entities

Yet another method of introduction is fusion of protoplasts with otherentities,—either minicells, cells, lysosomes or other fusiblelipid-surfaced bodies. Fraley, et al., Proc. Natl. Acad. Sci. USA, 79:1859-63 (1982), which is hereby incorporated by reference.

5.7.4.12.4 Electroporation

The DNA molecule may also be introduced into the plant cells byelectroporation. Fromm et al., Proc. Natl. Acad. Sci. USA, 82: 5824(1985), which is hereby incorporated by reference. In this technique,plant protoplasts are electroporated in the presence of plasmidscontaining the expression cassette. Electrical impulses of high fieldstrength reversibly permeabilize biomembranes allowing the introductionof the plasmids. Electroporated plant protoplasts reform the cell wall,divide, and regenerate.

5.7.4.12.5 Infection with Agrobacterium tumefaciens or A. rhizogenes

Another method of introducing the DNA molecule into plant cells is toinfect a plant cell with Agrobacterium tumefaciens or A. rhizogenespreviously transformed with the gene. Under appropriate conditions knownin the art, the transformed plant cells are grown to form shoots orroots, and develop further into plants. Generally, this procedureinvolves inoculating the plant tissue with a suspension of bacteria andincubating the tissue for 48 to 72 hours on regeneration medium withoutantibiotics at 25-28° C.

Methods are explained in various references such as: J. Schell, Science,237: 1176-83 (1987); (U.S. Pat. No. 5,258,300); Herrera-Estrella,Nature, 303, 209 (1983), Biotechnica (published PCT application PCTWO/91/10725), and U.S. Pat. No. 4,940,838.

5.7.21 Method for Making Genetically Recombinant Plants in CommerciallyFeasible Numbers

In one preferred embodiment, the invention provides for a process forpropagating plants by tissue culture in such a way as both to conservedesired plant morphology and to transform the plant with respect to oneor more desired genes. The method includes the steps of (a) creating anAgrobacterium vector containing the gene sequence desired to betransferred to the propagated plant, preferably together with a markergene; (b) taking one or more petiole explants from a mother plant andinoculating them with the Agrobacterium vector; (c) conducting callusformation in the petiole sections in culture, in the dark; and (d)culturing the resulting callus in growth medium having a benzylaminogrowth regulator such as benzylaminopurine or, most preferably,benzylaminopurineriboside. Additional optional growth regulatorsincluding auxins and cytokinins (indole butyric acid, benzylamine,benzyladenine, benzylaminopurine, alpha naphthylacetic acid and othersknown in the art) may also be present. Preferably, the petiole tissue istaken from Pelarcronium x domesticum and the Acrobacterium vectorcontains an antisense gene for ACC synthase or ACC oxidase to preventACC synthase or ACC oxidase expression and, in turn, preventing ethyleneformation. Pelargoniums propagated in culture using the presenttechnique are resistant to wilting and petal shatter, and aremorphologically conserved due to the use of petiole explantsspecifically and the particular culture media disclosed. Using a probefor the transposon, the mutated gene can be isolated. Then, using theDNA adjacent to the transposon in the isolated, mutated gene as a probe,the normal wild-type allele of the target gene can be isolated. Suchtechniques are taught, for example, in McLaughlin and Walbot, Genetics,Vol. 117 pp. 771-776 (1987), as well as numerous other references.

5.7.21.1 Reporter Gene

In addition to the functional gene and the selectable marker gene, theDNA sequences may also contain a reporter gene which facilitatesscreening of the transformed shoots and plant material for the presenceand expression of endogenous DNA sequences. Exemplary reporter genesinclude β-glucuronidase and luciferase.

5.7.21.2. Transfer Regions of a Suitable Plasmid

As described above, the exogenous DNA sequences are introduced into thearea of the explants by incubation with Agrobacterium cells which carrythe sequences to be transferred within a transfer DNA (T-DNA) regionfound on a suitable plasmid, typically the Ti plasmid. Ti plasmidscontain two regions essential for the transformation of plant cells. Oneof these, the T-DNA region, is transferred to the plant nuclei andinduces tumor formation. The other, referred to as the virulence (vir)region, is essential for the transfer of the T-DNA but is not itselftransferred. By inserting the DNA sequence to be transferred into theT-DNA region, introduction of the DNA sequences to the plant genome canbe effected. Usually, the Ti plasmid will be modified to delete or toinactivate the tumor-causing genes so that they are suitable for use asa vector for the transfer of the gene constructs of the presentinvention. Other plasmids may be utilized in conjunction withAgrobacterium for transferring the DNA sequences of the presentinvention to the plant cells.

The construction of recombinant Ti plasmids may be accomplished usingconventional recombinant DNA techniques, such as those described bySambrook et al. (1989). Frequently, the plasmids will include additionalselective marker genes which permit manipulation and construction of theplasmids in suitable hosts, typically bacterial hosts other thanAgrobacterium, such as E. coli. In addition to the above-describedkanamycin resistance marker gene, other exemplary genes are thetetracycline resistance gene and the ampicillin resistance gene, amongothers.

5.7.21.3 Confirming Transformation

After green transformed shoots are approximately ½″ tall, they can thenbe transplanted to soil within a greenhouse or elsewhere in aconventional manner for tissue culture plantlets. Transformation of theresulting plantlets can be confirmed by assaying activity for theselection marker, or by assaying the plant material for any of thephenotypes which have been introduced by the exogenous DNA. Suitableassay techniques include polymerase chain reaction (PCR), restrictionenzyme digestion, Southern blot hybridization and Northern blothybridization.

5.7.21.4. Commercial Production

The present invention represents a breakthrough in the commercialproduction and genetically transformed plants. Because the method usespetiole tissue from a grower's mother plant (a stock plant), thestarting petiole explants have a commercially desirable morphology tobegin with—by definition. However, if the mother plant could be improvedby genetic transformation of some type, for example to deactivate a genewhich expresses an enzyme in the ethylene synthesis pathway, the progenyof the mother plant may thus be improved in this one way over theirparent stock. The petiole tissue from the stock plant, plus the genetictransformation from the Agrobacterium, yield both an improved geneticmakeup of the commercially produced plants—although with preserveddesired morphology from the mother plant—and at the same time the highyields possible only with the generation of many plantlets in a singlegeneration's growth in tissue culture. In summary, with the presentmethod a single genetically transformed mother plant can yield literallythousands of offspring plants. No one in the prior art has attempted tocombine these two previously disparate technologies to achieve a uniquemethod in which the result is no less than a commercially viabletechnique for making genetically recombinant plants in commerciallyfeasible numbers.

5.7.22 Transformation of Plant Cells Using Alternative Methods

A variety of techniques are available for the introduction of thegenetic material into or transformation of the plant cell host. However,the particular manner of introduction of the plant vector into the hostis not critical to the practice of the present invention, and any methodwhich provides for efficient transformation may be employed. In additionto transformation using plant transformation vectors derived from thetumor-inducing (Ti) or root-inducing (RI) plasmids of Agrobacterium,alternative methods could be used to insert the DNA constructs of thepresent invention into plant cells. Such methods may include, forexample, the use of liposomes, transformation using viruses or pollen,chemicals that increase the direct uptake of DNA (Paszkowski et al.,EMBO J., 3, 2717 (1984)), microinjection (Crossway et al., Mol. Gen.Genet., 202, 179 (1985)), electroporation (Fromm et al., Proc. Natl.Acad. Sci. US, 82, 824 (1985)), or high-velocity microprojectiles (Kleinet al., Nature, 327, 70 (1987)).

5.7.22.1 Plant Tissue Source or Cultured Plant Cell

The choice of plant tissue source or cultured plant cells fortransformation will depend on the nature of the host plant andthe—transformation protocol. Useful tissue sources include callus,suspension culture cells, protoplasts, leaf segments, stem segments,tassels, pollen, embryos, hypocotyls, tuber segments, meristematicregions, and the like.

The tissue source is regenerable, in that it will retain the ability toregenerate whole, fertile plants following transformation.

5.7.22.2 Conditions During Transformation

The transformation is carried out under conditions directed to the planttissue of choice. The plant cells or tissue are exposed to the DNAcarrying the present multi-gene expression cassette for an effectiveperiod of time. This may range from a less-than-one-second pulse ofelectricity for electroporation, to a two-to-three day co-cultivation inthe presence of plasmid-beazing Agrobacterium cells. Buffers and mediaused will also vary with the plant tissue source and transformationprotocol. Many transformation protocols employ a feeder layer ofsuspended culture cells (tobacco or Black Mexican Sweet Corn, forexample) on the surface of solid media plates, separated by a sterilefilter paper disk from the plant cells or tissues being transformed.

Following treatment with DNA, the plant cells or tissue may becultivated for varying lengths of time prior to selection, or may beimmediately exposed to a selective agent such as those describedhereinabove.

5.7.22.3 Inhibitory Agent

Protocols involving exposure to Agrobacterium will also include an agentinhibitory to the growth of the Agrobacterium cells. Commonly usedcompounds are antibiotics such as cefotaxime and carbenicillin. Themedia used in the selection may be formulated to maintain transformedcallus or suspension culture cells in an undifferentiated state, or toallow production of shoots from callus, leaf or stem segments, tuberdisks, and the like.

5.7.23 Method for Transformation of The Target Plant

The methods used for the actual transformation of the target plant arenot critical to this invention. The transformation of the plant ispreferably permanent, e.g. by integration of introduced sequences intothe plant genome, so that the introduced sequences are passed ontosuccessive plant generations. There are many plant transformationtechniques well-known to workers in the art, and new techniques arecontinually becoming known. Any technique that is suitable for thetarget plant can be employed with this invention. For example, thesequences can be introduced in a variety of forms, such as a strand ofDNA, in a plasmid, or in an artificial chromosome, to name a few. Theintroduction of the sequences into the target plant cells can beaccomplished by a variety of techniques, as well, such as calciumphosphate-DNA co-precipitation, electroporation, microinjection,Agrobacterium infection, liposomes or microprojectile transformation.Those of ordinary skill in the art can refer to the literature fordetails, and select suitable techniques without undue experimentation.

5.7.23.1 Introduction of Sequences into Target Plant Cells

It is possible to introduce the recombinase gene, in particular, intothe transgenic plant in a number of ways. The gene can be introducedalong with all of the other basic sequences, as in the first preferredembodiment described above. The repressible promoter/recombinaseconstruct can be also introduced directly via a viral vector into atransgenic plant that contains the other sequence components of thesystem. Still another method of introducing all the necessary sequencesinto a single plant is the second preferred embodiment described above,involving a first transgenic plant containing the transiently-activepromoter/structural gene sequences and the blocking sequence, and asecond transgenic plant containing the recombinase gene linked to agermination-specific plant-active promotor, the two plants beinghybridized by conventional to produce hybrid progeny containing all thenecessary sequences.

It is also possible to introduce the recombinase itself directly into atransgenic plant as a conjugate with a compound such as biotin, that istransported into the cell. See Horn, et al. (1990).

5.7.4.15.2 Direct or Vectored Transformation

Various methods for direct or vectored transformation of plant cells,e.g., plant protoplast cells, have been described, e.g., in above-citedPCT application WO 95/14099. As noted in that reference, promotersdirecting expression of selectable markers used for plant transformation(e.g., nptII) should operate effectively in plant hosts. One suchpromoter is the nos promoter from native Ti plasmids (Heffera-Estrella,et al., Nature 303: 209-213 (1983). Others include the 35S and 19Spromoters of cauliflower mosaic virus (Odell, et al., Nature 313:810-812 (1985) and the 2′ promoter (Velten, et al., EMBO J. 3: 2723-2730(1984).

In one preferred embodiment, the embryo and endosperm. of mature seedsare removed to exposed scutulum tissue cells. The cells may betransformed by DNA bombardment or injection, or by vectoredtransformation, e.g., by Agrobacteriwn infection after bombarding thescuteller cells with microparticles to make them susceptible toAgrobacteriwn infection (Bidney et al., Plant Mol. Biol. 18: 301-313,1992).

One preferred transformation follows the methods detailed generally inSivamani, E. et al., Plant Cell Reports 15: 465 (1996); Zhang, S., etal., Plant Cell Reports 15: 465 (1996); and Li, L., et al., Plant CellReports 12: 250 (1993).

5.7.24 Subculturing Cells or Callus Growing in Normally InhibitoryConcentrations of the Selective Agents

Cells or callus observed to be growing in the presence of normallyinhibitory concentrations of the selective agents are presumed to betransformed and may be subcultured several additional times on the samemedium to remove non-resistant sections. The cells or calli can then beassayed for the presence of the viral gene cassette, or may be subjectedto known plant regeneration protocols. In protocols involving the directproduction of shoots, those shoots appearing on the selective media arepresumed to be transformed and may be excised and rooted, either onselective medium suitable for the production of roots, or by simplydipping the excised shoot in a root-inducing compound and directlyplanting it in vermiculite.

5.7.25 Selecting for Multi-Viral Resistance

In order to produce transgenic plants exhibiting multi-viral resistance,the viral genes must be taken up into the plant cell and stablyintegrated within the plant genome. Plant cells and tissues selected fortheir resistance to an inhibitory agent are presumed to have acquiredthe selectable marker gene encoding this resistance during thetransformation treatment.

Since the marker gene is commonly linked to the viral genes, it can beassumed that the viral genes have similarly been acquired. Southern blothybridization analysis using a probe specific to the viral genes canthen be used to confirm that the foreign genes have been taken up andintegrated into the genome of the plant cell. This technique may alsogive some indication of the number of copies of the gene that have beenincorporated. Successful transcription of the foreign gene into mRNA canlikewise be assayed using Northern blot hybridization analysis of totalcellular RNA and/or cellular RNA that has been enriched in apolyadenylated region. mRNA molecules encompassed within the scope ofthe invention are those which contain viral specific sequences derivedfrom the viral genes present in the transformed vector which are of thesame polarity to that of the viral genomic RNA such that they arecapable of base pairing with viral specific RNA of the opposite polarityto that of viral genomic RNA under conditions described in Chapter 7 ofSambrook et al. (1989). mRNA molecules also encompassed within the scopeof the invention are those which contain viral specific sequencesderived from the viral genes present in the transformed vector which areof the opposite polarity to that of the viral genomic RNA such that theyare capable of base pairing with viral genomic RNA under conditionsdescribed in Chapter 7 of Sambrook et al. (1989).

The presence of a viral gene can also be detected by immunologicalassays, such as the double-antibody sandwich assays described by Namba,et al., Gene, 107, 181 (1991) as modified by Clark et al., J. Gen.Virol., 34, 475 (1979). See also, Namba et al., Phytopathology, 82, 940(1992).

Virus resistance can be assayed via infectivity studies as generallydisclosed by Namba et al., ibid., wherein plants are scored assymptomatic when any inoculated leaf shows veinclearing, mosaic ornecrotic symptoms.

It is understood that the invention is operable when either sense oranti-sense viral specific RNA is transcribed from the expressioncassettes described above. That is, there is no specific molecularmechanism attributed to the desired phenotype and/or genotype exhibitedby the transgenic plants.

Thus, protection against viral challenge can occur by any one or anynumber of mechanisms.

It is also understood that virus resistance can occur by the expressionof any virally encoded gene. Thus, transgenic plants expressing a coatprotein gene or a non-coat protein gene can be resistant to challengewith a homologous or heterologous virus.

5.7.26 Cell Culture Production of Mature Heterologous Protein

Transgenic cells, typically callus cells, are cultured under conditionsthat favor plant cell growth, until the cells reach a desired celldensity, then under conditions that favor expression of the matureprotein under the control of the given promoter. Preferred cultureconditions are described herein. Purification of the mature proteinsecreted into the medium is by standard techniques known by those ofskill in the art.

In one embodiment of the invention, in which BPN′ is secreted as theproBPN′ form of the enzyme, the chaperon “pro” moiety of the enzymefacilitates enzyme folding and is cleaved from the enzyme, leaving theactive mature form of BPN'. In another embodiment, the mature enzyme isco-expressed and co-secreted with the “pro” chaperon moiety, withconversion of the enzyme to active form occurring in presence of thefree chaperon (Eder et al., Biochem. (1993) L2: 18-26; Eder et al,(1993) J. Mol. Biol. 223: 293-304). In yet another embodiment of theinvention, the BPN′ is secreted in inactive form at a pH that may be inthe 6-8 range, with subsequent activation of the inactive form, e.g.,after enzyme isolation, by exposure to the “pro” chaperon moiety, e.g.,immobilized to a solid support. In both of these embodiments, theculture medium is maintained at a pH of between 5 and 6, preferablyabout 5.5 during the period of active expression and secretion of BPN',to keep the BPN', which is normally active at alkaline pH, at a pH belowoptimal activity.

5.7.26.1 Production of Mature Heterologous Protein in Germinatin Seeds

In this embodiment, monocot cells transformed as above are used toregenerate plants, seeds from the plants are harvested and thengerminated, and the mature protein is isolated from the germinatedseeds.

Plant regeneration from cultured protoplasts or callus tissue is carriedby standard methods, e.g., as described in Evans et al., HANDBOOK OFPLANT CELL CULTURE Vol. 1: (MacMillan Publishing Co. New York, 1983);and Vasil I. R. (ed.), CELL CULTURE AND SOMATIC CELL GENETICS OF PLANTS,Acad. Press, Orlando, Vol. 1, 1984, and Vol. 111, 1986, and as describedin the above-cited PCT application.

To achieve maximum production of recombinant protein from malting, themalting procedure may be modified to accommodate de-hulled andde-embryonated seeds, as described in above-cited PCT application WO95/14099. In the absence of sugars from the endosperm, there is expectedto be a 5 to 10 fold increase in RAmy3D promoter activity and thusexpression of heterologous protein. Alternatively when embryolesshalf-seeds are incubated in 10 mM CaCl₂ and 5 μM gibberellic acid, thereis a 50 fold increase in RAmy1A promoter activity.

5.7.27 Regeneration of the Transformed Plant Cells

After transformation, the transformed plant cells must be regenerated.

The methods used to regenerate transformed cells into whole plants arenot critical to this invention, and any method suitable for the targetplant can be employed. The literature describes numerous techniques forregenerating specific plant types, (e.g., via somatic embryogenesis,Umbeck, et al., 1987) and more are continually becoming known. Those ofordinary skill in the art can refer to the literature for details andselect suitable techniques without undue experimentation.

Plant regeneration from cultured protoplasts is described in Evans etal., Handbook of Plant Cell Cultures, Vol. 1: (MacMillan Publishing Co.,New York, 1983); and Vasil T. R. (ed.), Cell Culture and Somatic CellGenetics of Plants, Acad. Press, Orlando, Vol. 1, 1984, and Vol.-III(1986), which are hereby incorporated is by reference.

It is known that practically all plants can be regenerated from culturedcells or tissues, including but not limited to, all major species ofsugarcane, sugar beets, cotton, fruit trees, and legumes.

Means for regeneration vary from species to species of plants, butgenerally a suspension of transformed protoplasts or a petri platecontaining explants is first provided. Callus tissue is formed andshoots may be induced from callus and subsequently rooted.Alternatively, embryo formation can be induced in the callus tissue.These embryos germinate as natural embryos to form plants. The culturemedia will generally contain various amino acids and hormones, such asauxin and cytokinins. It is also advantageous to add glutamic acid andproline to the medium, especially for such species as corn and alfalfa.Efficient regeneration will depend on the medium, on the genotype, andon the history of the culture. If these three variables are controlled,then regeneration is usually reproducible and repeatable.

5.7.28 Breeding Techniques

After the expression cassette is stably incorporated in transgenicplants, it can be transferred to other plants by sexual crossing. Any ofa number of standard breeding techniques can be used, depending upon thespecies to be crossed.

Seed from plants regenerated from tissue culture is grown in the fieldand self-pollinated to generate true breeding plants. The progeny fromthese plants become true breeding lines which are evaluated for viralresistance in the field under a range of environmental conditions. Thecommercial value of viral-resistant plants is greatest if many differenthybrid combinations with resistance are available for sale. The farmertypically grows more than one kind of hybrid based on such differencesas maturity, disease and insect resistance, color or other agronomictraits. Additionally, hybrids adapted to one part of a country are notadapted to another part because of differences in such traits asmaturity, disease and insect tolerance, or public demand for specificvarieties in given geographic locations.

Because of this, it is necessary to breed viral resistance into a largenumber of parental lines so that many hybrid combinations can beproduced.

Adding viral resistance to agronomically elite lines is most efficientlyaccomplished when the genetic control of viral resistance is understood.This requires crossing resistant and sensitive plants and studying thepattern of inheritance in segregating generations to ascertain whetherthe trait is expressed as dominant or recessive, the number of genesinvolved, and any possible interaction between genes if more than oneare required for expression. With respect to transgenic plants of thetype disclosed herein, the transgenes exhibit dominant, single geneMendelian behavior. This genetic analysis can be part of the initialefforts to covert agronomically elite, yet sensitive lines to resistantlines. A conversion process (backcrossing) is carried out by crossingthe original resistant line with a sensitive elite line and crossing theprogeny back to the sensitive parent. The progeny from this cross willsegregate such that some plants carry the resistance gene(s) whereassome do not. Plants carrying the resistance gene(s) will be crossedagain to the sensitive parent resulting in progeny which segregate forresistance and sensitivity once more. This is repeated until theoriginal sensitive parent has been converted to a resistant line, yetpossesses all of the other important attributes originally found in thesensitive parent. A separate backcrossing program is implemented forevery sensitive elite line that is to be converted to a virus resistantline.

Subsequent to the backcrossing, the new resistant lines and theappropriate combinations of lines which make good commercial hybrids areevaluated for viral resistance, as well as for a battery of importantagronomic traits. Resistant lines and hybrids are produced which aretrue to type of the original sensitive lines and hybrids. This requiresevaluation under a range of environmental conditions under which thelines or hybrids will be grown commercially. Parental lines of hybridsthat perform satisfactorily are increased and utilized for hybridproduction using standard hybrid production practices.

5.7.28.1 Use of Conventional Cultivation

Once transgenic plants of this type are produced, the plants themselvescan be cultivated in accordance with conventional procedure so that theDNA construct is present in the resulting plants. Alternatively,transgenic seeds are recovered from the transgenic plants. These seedscan then be planted in the soil and cultivated using conventionalprocedures to produce transgenic plants.

5.7.28.2 Plant Varieties

The present invention can be used to make a variety of transgenicplants. The method is particularly suited for use with plants that areplanted as a yearly crop from seed. These include, but are not limitedto, fiber crops such as cotton and flax; dicotyledonous seed crops suchas soybean, sunflower and peanut; annual ornamental flowers;monocotyledonous grain crops such as maize, wheat and sorghum; leafcrops such as tobacco; vegetable crops such as lettuce, carrot,broccoli, cabbage and cauliflower; and fruit crops such as tomato,zucchini, watermelon, cantaloupe and pumpkin.

The present invention can be utilized in conjunction with a wide varietyof plants or their seeds.

Suitable plants include dicots and monocots. More particularly, usefulcrop plants can include: alfalfa, rice, wheat, barley, rye, cotton,sunflower, peanut, corn, potato, sweet potato, bean, pea, chicory,lettuce, endive, cabbage, brussel sprout, beet, parsnip, turnip,cauliflower, broccoli, turnip, radish, spinach, onion, garlic, eggplant,pepper, celery, carrot, squash, pumpkin, zucchini, cucumber, apple,pear, melon, citrus, strawberry, grape, raspberry, pineapple, soybean,tobacco, tomato, sorghum, papaya, and sugarcane.

Examples of suitable ornamental plants are: Arabidopsis thaliana,Saintpaulia, petunia, pelargonium, poinsettia, chrysanthemum, carnation,and zinnia.

The plants used in the process of the present invention are derived frommonocots, particularly the members of the taxonomic family known as theGramineae. This family includes all members of the grass family of whichthe edible varieties are known as cereals. The cereals include a widevariety of species such as wheat (Triticwn sps.), rice (Oryza sps.)barley (Hordewn sps.) oats, (Avena sps.) rye (Secale sps.), corn (Zeasps.) and millet (Pennisettum sps.). In the present invention, preferredfamily members are rice and barley.

5.7.29 Identification and Localization and Introgression into Plants ofDesired Multigenic Traits with RFLP Technology

The invention typically involves genetic linkage maps constructed withRFLP technology and the use of RFLP probes to correlate those probeswith Quantitative Trait Loci (QTL) and the degree of inheritance ofparticular multigenic traits (For references see: PTC numbers WO96/21031; WO 97/17429; WO 98/37223; WO 98/36085; U.S. Pat. Nos.5,925,808 and 5,385,835).

1. A method for identifying proteins by differential labeling ofpeptides, the method comprising the following steps: (a) providing asample comprising a polypeptide; (b) providing a plurality of labelingreagents which differ in molecular mass that can generate differentiallabeled peptides that do not differ in chromatographic retentionproperties and do not differ in ionization and detection properties inmass spectrographic analysis, wherein the differences in molecular massare distinguishable by mass spectrographic analysis; (c) fragmenting thepolypeptide into peptide fragments by enzymatic digestion or bynon-enzymatic fragmentation; (d) contacting the labeling reagents ofstep (b) with the peptide fragments of step (c), thereby labeling thepeptides with the differential labeling reagents; (e) separating thepeptides by chromatography to generate an eluate; (f) feeding the eluateof step (e) into a mass spectrometer and quantifying the amount of eachpeptide and generating the sequence of each peptide by use of the massspectrometer; (g) inputting the sequence to a computer program productwhich compares the inputted sequence to a database of polypeptidesequences to identify the polypeptide from which the sequenced peptideoriginated.
 2. The method of claim 1, wherein the sample of step (a)comprises a cell or a cell extract.
 3. The method of claim 1, furthercomprising providing two or more samples comprising a polypeptide. 4.The method of claim 3, wherein one sample is derived from a wild typecell and one sample is derived from an abnormal or a modified cell. 5.The method of claim 4, wherein the abnormal cell is a cancer cell. 6.The method of claim 1, further comprising purifying or fractionating thepolypeptide before the fragmenting of step (c).
 7. The method of claim1, further comprising purifying or fractionating the polypeptide beforethe labeling of step (d).
 8. The method of claim 1, further comprisingpurifying or fractionating the labeled peptide before the chromatographyof step (e).
 9. The method of claim 6, claim 8 or claim 8, wherein thepurifying or fractionating comprises a method selected from the groupconsisting of size exclusion chromatography, size exclusionchromatography, HPLC, reverse phase HPLC and affinity purification. 10.The method of claim 1, further comprising contacting the polypeptidewith a labeling reagent of step (b) before the fragmenting of step (c).11. The method of claim 1, wherein the labeling reagent of step (b)comprises the general formulae selected from the group consisting of: i.Z^(A)OH and Z^(B)OH, to esterify peptide C-terminals and/or Glu and Aspside chains; ii. Z^(A)NH₂ and Z^(B)NH₂, to form amide bond with peptideC-terminals and/or Glu and Asp side chains; and iii. Z^(A)CO₂H andZ^(B)CO₂H. to form amide bond with peptide N-terminals and/or Lys andArg side chains; wherein Z^(A) and Z^(B) independently of one anothercomprise the general formula R-Z¹-A¹-Z²-A²-Z³-A³-Z⁴-A⁴-, Z¹, Z², Z³, andZ⁴ independently of one another, are selected from the group consistingof nothing, 0, OC(O), OC(S), OC(O)O, OC(O)NR, OC(S)NR, OSiRR¹, S, SC(O),SC(S), SS, S(O), S(O₂), NR, NRR¹⁺, C(O), C(O)O, C(S), C(S)O, C(O)S,C(O)NR, C(S)NR, SiRR¹, (Si(RR¹)O)_(n), SnRR¹, Sn(RR¹)O, BR(OR¹), BRR¹,B(OR)(OR¹), OBR(OR¹, OBRR¹, and OB(OR)(OR¹), and R and R¹ is an alkylgroup, A¹, A², A³, and A⁴ independently of one another, are selectedfrom the group consisting of nothing or (CRR¹)_(n), wherein R, R¹,independently from other R and R¹ in Z¹ to Z⁴ and independently fromother R and R¹ in A¹ to A⁴, are selected from the group consisting of ahydrogen atom, a halogen atom and an alkyl group; n in Z¹to Z⁴,independent of n in A¹ to A⁴, is an integer having a value selected fromthe group consisting of 0 to about 51; 0 to about 41; 0 to about 31; 0to about 21, 0 to about 11 and 0 to about
 6. 12. The method of claim 11,wherein the alkyl group is selected from the group consisting of analkenyl, an alkynyl and an aryl group.
 13. The method of claim 11,wherein one or more C—C bonds from (CRR¹)n are replaced with a double ora triple bond,
 14. The method of claim 13, wherein an R or an R¹ groupis deleted.
 15. The method of claim 13, wherein (CRR¹)n is selected fromthe group consisting of an o-arylene, an m-arylene and a p-arylene,wherein each group has none or up to 6 substituents.
 16. The method ofclaim 13, wherein (CRR¹)n is selected from the group consisting of acarbocyclic, a bicyclic and a tricyclic fragment, wherein the fragmenthas up to 8 atoms in the cycle with or without a heteroatom selectedfrom the group consisting of an O atom, a N atom and an S atom.
 17. Themethod of claim 1, wherein two or more labeling reagents have the samestructure but a different isotope composition.
 18. The method of claim11, wherein Z^(A) has the same structure as Z^(B), but Z^(A) has adifferent isotope composition than Z^(B).
 19. The method of claim 17,wherein the isotope is boron-10 and boron-11.
 20. The method of claim17, wherein the isotope is carbon-12 and carbon-13.
 21. The method ofclaim 17, wherein the isotope is nitrogen-14 and nitrogen-15.
 22. Themethod of claim 17, wherein the isotope is sulfur-32 and sulfur-34. 23.The method of claim 17, wherein, where the isotope with the lower massis x and the isotope with the higher mass is y, and x and y areintegers, x is greater than y.
 24. The method of claim 17, wherein x andy are between 1 and about 11, between 1 and about 21, between 1 andabout 31, between 1 and about 41, or between 1 and about
 51. 25. Themethod of claim 1, wherein the labeling reagent of step (b) comprisesthe general formulae selected from the group consisting of: i.CD₃(CD₂)_(n)OH/CH₃(CH₂)_(n)OH, to esterify peptide C-terminals, wheren=0, 1, 2 or y; ii. CD₃(CD₂)_(n)NH₂ CH₃(CH₂)_(n)NH₂, to form amide bondwith peptide C-terminals, where n=0, 1, 2 or y; and iii.D(CD₂)_(n)CO₂H/H(CH₂)_(n)CO₂H, to form amide bond with peptideN-terminals, where n=0, 1, 2 or y; wherein D is a deuteron atom, and yis an integer selected from the group consisting of about 51; about 41;about 31; about 21, about 11; about 6 and between about 5 and
 51. 26.The method of claim 1, wherein the labeling reagent of step (b)comprises the general formulae selected from the group consisting of: i.Z^(A)OH and Z^(B)OH to esterify peptide C-terminals; ii.Z^(A)NH₂/Z^(B)NH₂ to form an amide bond with peptide C-terminals; andiii. Z^(A)CO₂H/Z^(B)CO₂H to form an amide bond with peptide N-terminals;wherein Z^(A) and Z^(B) have the general formulaR-Z¹-A¹-Z²-A²-Z³-A³-Z⁴-A⁴-Z¹, Z², Z³, and Z⁴, independently of oneanother, are selected from the group consisting of nothing, 0, OC(O),OC(S), OC(O)O, OC(O)NR, OC(S)NR, OSiRR¹, S, SC(O), SC(S), SS, S(O),S(O₂), NR, NRR¹⁺, C(O), C(O)O, C(S), C(S)O, C(O)S, C(O)NR, C(S)NR,SiRR¹, (Si(RR¹)O)_(n), SnRR¹, Sn(RR¹)O, BR(OR¹), BRR¹, B(OR)(OR¹),OBR(OR¹), OBRR¹, and OB(OR)(OR¹); A¹, A², A³, and A⁴, independently ofone another, are selected from the group consisting of nothing and thegeneral formulae (CRR¹)_(n), and, R and R¹ is an alkyl group.
 27. Themethod of claim 26, wherein a single C—C bond in a (CRR¹)_(n) group isreplaced with a double or a triple bond.
 28. The method of claim 27,wherein R and R¹ are absent.
 29. The method of claim 27, wherein(CRR¹.)_(n) comprises a moiety selected from the group consisting of ano-arylene, an m-arylene and ap-arylene, wherein the group has none or upto 6 substituents.
 30. The method of claim 27, wherein the groupcomprises a carbocyclic, a bicyclic, or a tricyclic fragments with up to8 atoms in the cycle, with or without a heteroatom selected from thegroup consisting of an O atom, an N atom and an S atom.
 31. The methodof claim 26, wherein R, R¹, independently from other R and R¹ in Z¹-Z⁴and independently from other R and R¹ in A¹-A⁴, are selected from thegroup consisting of a hydrogen atom, a halogen and an alkyl group. 32.The method of claim 31, wherein the alkyl group is selected from thegroup consisting of an alkenyl, an alkynyl and an aryl group.
 33. Themethod of claim 26, wherein n in Z¹-Z⁴ is independent of n in A¹-A⁴ andis an integer selected from the group consisting of about 51; about 41;about 31; about 21, about 11 and about
 6. 34. The method of claim 26,wherein Z^(A) has the same structure a Z^(B) but Z^(A) further comprisesx number of —CH₂— fragment(s) in one or more A¹-A⁴ fragments, wherein xis an integer.
 35. The method of claim 26, wherein Z^(A) has the samestructure a Z^(B) but Z^(A) further comprises x number of —CF₂—fragment(s) in one or more A¹-A⁴ fragments, wherein x is an integer. 36.The method of claim 26, wherein Z^(A) comprises x number of protons andZ^(B) comprises y number of halogens in the place of protons, wherein xand y are integers.
 37. The method of claim 26, wherein Z^(A) contains xnumber of protons and Z^(B) contains y number of halogens, and there arex−y number of protons remaining in one or more A¹-A⁴ fragments, whereinx and y are integers
 38. The method of claim 26, wherein Z^(A) furthercomprises x number of-O— fragment(s) in one or more A¹-A⁴ fragments,wherein x is an integer.
 39. The method of claim 26, wherein Z^(A)further comprises x number of —S— fragment(s) in one or more A¹-A⁴fragments, wherein x is an integer.
 40. The method of claim 26, whereinZ^(A) further comprises x number of —O— fragment(s) and Z^(B) furthercomprises y number of —S— fragment(s) in the place of —O— fragment(s),wherein x and y are integers.
 41. The method of claim 26, wherein Z^(A)further comprises x−y number of —O— fragment(s) in one or more A¹-A⁴fragments, wherein x and y are integers.
 42. The method of claim 37,claim 40 or claim 41, wherein x and y are integers selected from thegroup consisting of between 1 about 51; between 1 about 41; between 1about 31; between 1 about 21, between 1 about 11 and between 1 about 6,wherein x is greater than y.
 43. The method of claim 1, wherein thelabeling reagent of step (b) comprises the general formulae selectedfrom the group consisting of: i. CH₃(CH₂)_(n)OH/CH₃(CH₂)_(n+m)OH, toesterify peptide C-terminals, where n=0, 1, 2, . . . , y; m=1, 2, . . .y; ii. CH₃(CH₂)_(n) NH₂/CH₃(CH₂)_(n+m)NH₂, to form amide bond withpeptide C-terminals, where n=0, 1, 2, . . . , y; m=1, 2, . . . , y; and,iii. H(CH₂)_(n)CO₂H/H(CH₂)_(n+m)CO₂H, to form amide bond with peptideN-terminals, where n=0, 1, 2, . . . , y; m=1, 2, . . . , y; wherein n, mand y are integers.
 44. The method of claim 43, wherein n, m and y areintegers selected from the group consisting of about 51; about 41; about31; about 21, about 11; about 6 and between about 5 and
 51. 45. Themethod of claim 1, wherein the separating of step (e) comprises a liquidchromatography system.
 46. The method of claim 1, wherein the liquidchromatography system comprises a multidimensional liquidchromatography.
 47. The method of claim 1, wherein the mass spectrometercomprises a tandem mass spectrometry device.
 48. The method of claim 1,further comprising quantifying the amount of each polypeptide.
 49. Themethod of claim 1, further comprising quantifying the amount of eachpeptide.
 50. A method for defining the expressed proteins associatedwith a given cellular state, the method comprising the following steps:(a) providing a sample comprising a cell in the desired cellular state;(b) providing a plurality of labeling reagents which differ in molecularmass but do not differ in chromatographic retention properties and donot differ in ionization and detection properties in mass spectrographicanalysis, wherein the differences in molecular mass are distinguishableby mass spectrographic analysis; (c) fragmenting polypeptides derivedfrom the cell into peptide fragments by enzymatic digestion or bynon-enzymatic fragmentation; (d) contacting the labeling reagents ofstep (b) with the peptide fragments of step (c), thereby labeling thepeptides with the differential labeling reagents; (e) separating thepeptides by chromatography to generate an eluate; (f) feeding the eluateof step (e) into a mass spectrometer and quantifying the amount of eachpeptide and generating the sequence of each peptide by use of the massspectrometer; (g) inputting the sequence to a computer program productwhich compares the inputted sequence to a database of polypeptidesequences to identify the polypeptide from which the sequenced peptideoriginated, thereby defining the expressed proteins associated with thecellular state.
 51. A method for quantifying changes in proteinexpression between at least two cellular states, the method comprisingthe following steps: state; (b) providing a plurality of labelingreagents which differ in molecular mass but do not differ inchromatographic retention properties and do not differ in ionization anddetection properties in mass spectrographic analysis, wherein thedifferences in molecular mass are distinguishable by mass spectrographicanalysis; (c) fragmenting polypeptides derived from the cells intopeptide fragments by enzymatic digestion or by non-enzymaticfragmentation; (d) contacting the labeling reagents of step (b) with thepeptide fragments of step (c), thereby labeling the peptides with thedifferential labeling reagents, wherein the labels used in one same aredifferent from the labels used in other samples; (e) separating thepeptides by chromatography to generate an eluate; (f) feeding the eluateof step (e) into a mass spectrometer and quantifying the amount of eachpeptide and generating the sequence of each peptide by use of the massspectrometer; (g) inputting the sequence to a computer program productwhich identifies from which sample each peptide was derived, comparesthe inputted sequence to a database of polypeptide sequences to identifythe polypeptide from which the sequenced peptide originated, andcompares the amount of each polypeptide in each sample, therebyquantifying changes in protein expression between at least two cellularstates.
 52. A method for identifying proteins by differential labelingof peptides, the method comprising the following steps: (a) providing asample comprising a polypeptide; (b) providing a plurality of labelingreagents which differ in molecular mass but do not differ inchromatographic retention properties and do not differ in ionization anddetection properties in mass spectrographic analysis, wherein thedifferences in molecular mass are distinguishable by mass spectrographicanalysis; (c) fragmenting the polypeptide into peptide fragments byenzymatic digestion or by non-enzymatic fragmentation; (d) contactingthe labeling reagents of step (b) with the peptide fragments of step(c), thereby labeling the peptides with the differential labelingreagents; (e) separating the peptides by multidimensional liquidchromatography to generate an eluate; (f) feeding the eluate of step (e)into a tandem mass spectrometer and quantifying the amount of eachpeptide and generating the sequence of each peptide by use of the massspectrometer; (g) inputting the sequence to a computer program productwhich compares the inputted sequence to a database of polypeptidesequences to identify the polypeptide from which the sequenced peptideoriginated.
 53. A chimeric labeling reagent comprising (a) a firstdomain comprising a biotin; and (b) a second domain comprising areactive group capable of covalently binding to an amino acid, whereinthe chimeric labeling reagent comprises at least one isotope.
 54. Thechimeric labeling reagent of claim 53, wherein the isotope is in thefirst domain.
 55. The chimeric labeling reagent of claim 54, wherein theisotope is in the biotin.
 56. The chimeric labeling reagent of claim 53,wherein the isotope is in the second domain.
 57. The chimeric labelingreagent of claim 53, wherein the isotope is selected from the groupconsisting of a deuterium isotope, a boron-10 or boron-11 isotope, acarbon-12 or a carbon-13 isotope, a nitrogen-14 or a nitrogen-15 isotopeand a sulfur-32 or a sulfur-34 isotope.
 58. The chimeric labelingreagent of claim 53 comprising two or more isotopes.
 59. The chimericlabeling reagent of claim 53, wherein the reactive group capable ofcovalently binding to an amino acid is selected from the groupconsisting of a succimide group, an isothiocyanate group and anisocyanate group.
 60. The chimeric labeling reagent of claim 53, whereinthe reactive group capable of covalently binding to an amino acid bindsto a lysine or a cysteine.
 61. The chimeric labeling reagent of claim53, further comprising a linker moiety linking the biotin group and thereactive group.
 62. The chimeric labeling reagent of claim 53; whereinthe linker moiety comprises at least one isotope.
 63. The chimericlabeling reagent of claim 53, wherein the linker is a cleavable moiety.64. The chimeric labeling reagent of claim 53, wherein the linker can becleaved by enzymatic digest.
 65. The chimeric labeling reagent of claim53, wherein the linker can be cleaved by reduction.
 66. A method ofcomparing relative protein concentrations in a sample comprising (a)providing a plurality of differential small molecule tags, wherein thesmall molecule tags are structurally identical but differ in theirisotope composition, and the small molecules comprise reactive groupsthat covalently bind to cysteine or lysine residues or both; (b)providing at least two samples comprising polypeptides; (c) attachingcovalently the differential small molecule tags to amino acids of thepolypeptides; (d) determining the protein concentrations of each samplein a tandem mass spectrometer; and, (d) comparing relative proteinconcentrations of each sample.
 67. The method of claim 66, wherein thesample comprises a complete or a fractionated cellular sample.
 68. Themethod of claim 66, wherein differential small molecule tags comprise achimeric labeling reagent comprising (a) a first domain comprising abiotin; and, (b) a second domain comprising a reactive group capable ofcovalently binding to an amino acid, wherein the chimeric labelingreagent comprises at least one isotope.
 69. The method of claim 68,wherein the isotope is selected from the group consisting of a deuteriumisotope, a boron-10 or boron-lI 1 isotope, a carbon-12 or a carbon-13isotope, a nitrogen-14 or a nitrogen-15 isotope and a sulfur-32 or asulfur-34 isotope.
 70. The method of claim 68, wherein the chimericlabeling reagent comprises two or more isotopes.
 71. The method of claim68, wherein the reactive group capable of covalently binding to an aminoacid is selected from the group consisting of a succimide group, anisothiocyanate group and an isocyanate group.
 72. A method of comparingrelative protein concentrations in a sample comprising (a) providing aplurality of differential small molecule tags, wherein the differentialsmall molecule tags comprise a chimeric labeling reagent comprising (i)a first domain comprising a biotin; and, (ii) a second domain comprisinga reactive group capable of covalently binding to an amino acid, whereinthe chimeric labeling reagent comprises at least one isotope; (b)providing at least two samples comprising polypeptides; (c) attachingcovalently the differential small molecule tags to amino acids of thepolypeptides; (d) isolating the tagged polypeptides on a biotin-bindingcolumn by binding tagged polypeptides to the column, washing non-boundmaterials off the column, and eluting tagged polypeptides off thecolumn; (e) determining the protein concentrations of each sample in atandem mass spectrometer; and, (f) comparing relative proteinconcentrations of each sample.
 73. A method of producing an improvedorganism having a desirable trait comprising: a) obtaining an initialpopulation of organisms, b) generating a set of mutagenized organisms,such that when all the genetic mutations in the set of mutagenizedorganisms are taken as a whole, there is represented a set ofsubstantial genetic mutations, and c) detecting the presence of saidimproved organism.
 74. The method of claim 73, wherein the set ofsubstantial genetic mutations in step b) is comprised of a knocking outof at least 15 different genes.
 75. The method of claim 73, wherein theset of substantial genetic mutations in step b) is comprised of aknocking out of at least 50 different genes.
 76. The method of claim 73,wherein the set of substantial genetic mutations in step b) is comprisedof a knocking out of at least 100 different genes.
 77. The method ofclaim 73, wherein the set of substantial genetic mutations in step b) iscomprised of an introduction of at least 15 different genes.
 78. Themethod of claim 73, wherein the set of substantial genetic mutations instep b) is comprised of an introduction of at least 50 different genes.79. The method of claim 73, wherein the set of substantial geneticmutations in step b) is comprised of an introduction of at least 100different genes.
 80. The method of claim 73, wherein the set ofsubstantial genetic mutations in step b) is comprised of an alterationin the expression of at least 15 different genes.
 81. The method ofclaim 73, wherein the set of substantial genetic mutations in step b) iscomprised of an alteration in the expression of at least 50 differentgenes.
 82. The method of claim 73, wherein the set of substantialgenetic mutations in step b) is comprised of an alteration in theexpression of at least 100 different genes.
 83. A method of producing animproved organism having a desirable trait comprising: a) obtaining aninitial population of organisms, b) generating a set of mutagenizedorganisms each having at least one genetic mutation, such that when allthe genetic mutations in the set of mutagenized organisms are taken as awhole, there is represented a set of substantial genetic mutations c)detecting the manifestation of at least two genetic mutations, d)introducing at least two detected genetic mutations into one organism,and e) optionally repeating any of steps a), b), c), and d).
 84. Themethod of claim 83, wherein step d) is comprised of a knocking out of atleast 15 different genes in one organism.
 85. The method of claim 83,wherein step d) is comprised of a knocking out of at least 50 differentgenes in one organism.
 86. The method of claim 83, wherein step d) iscomprised of a knocking out of at least 100 different genes in oneorganism.
 87. The method of claim 83, wherein step d) is comprised of anintroduction of at least 15 different genes into one organism.
 88. Themethod of claim 83, wherein step d) is comprised of an introduction ofat least 50 different genes into one organism.
 89. The method of claim83, wherein step d) is comprised of an introduction of at least 100different genes into one organism.
 90. The method of claim 83, whereinstep d) is comprised of an alteration in the expression of at least 15different genes in one organism.
 91. The method of claim 83, whereinstep d) is comprised of an alteration in the expression of at least 50different genes in one organism.
 92. The method of claim 83, whereinstep d) is comprised of an alteration in the expression of at least 100different genes in one organism.
 93. A method for identifying a genethat alters a trait of an organism, comprising: a) obtaining an initialpopulation of organisms, b) generating a set of mutagenized organisms,such that when all the genetic mutations in the set of mutagenizedorganisms are taken as a whole, there is represented a set ofsubstantial genetic mutations, and c) detecting the presence an organismhaving said altered trait, and d) determining the nucleotide sequence ofa gene that has been mutagenized in the organism having the alteredtrait.
 94. A method for producing an organism with an improved trait,comprising: a) functionally knocking out an enogenous gene in asubstantially clonal population of organisms; b) transferring a libraryof altered genes into the substantially clonal population of organisms,wherein each altered gene differs from the endogenous gene at only onecodon; c) detecting a mutagenized organism having an improved trait; andd) determining the nucleotide sequence of an gene that has beentransferred into the detected organism.
 95. A method of introducingdifferentially activatable stacked traits into a transgenic cell ororganism, which method is comprised of the following steps: a) obtainingan initial cell or organism; b) introducing into the working cell ororganism a plurality of traits (stacked traits), including selectivelyand differentially activatable traits, whereby serviceable traits forthis purpose include traits conferred by genes and traits conferred bygene pathways; c) analyzing the information obtained from steps a) andb), and d) optionally repeating any number or all of the steps of a),b), c), and d);
 96. The method of claim 95, wherein step a) alsoincludes holistic monitoring of the strain or organism whereby holisticmonitoring can include the detection and/or measurement of alldetectable functions and physical parameters (such as but not limited tomorphology, behavior, growth, responsiveness to stimuli [e.g.,antibiotics, different environment, etc.], and profiles of alldetectable molecules, including molecules that are chemically at leastin part a nucleic acids, proteins, carbohydrates, proteoglycans,glycoproteins, or lipids)
 97. The method of claim 95, wherein step d)also includes holistic monitoring of the strain or organism wherebyholistic monitoring can include the detection and/or measurement of alldetectable functions and physical parameters (such as but not limited tomorphology, behavior, growth, responsiveness to stimuli [e.g.,antibiotics, different environment, etc.], and profiles of alldetectable molecules, including molecules that are chemically at leastin part a nucleic acids, proteins, carbohydrates, proteoglycans,glycoproteins, or lipids)
 98. The method of claim 95, wherein step a)and d) include holistic monitoring of the strain or organism wherebyholistic monitoring can include the detection and/or measurement of alldetectable functions and physical parameters (such as but not limited tomorphology, behavior, growth, responsiveness to stimuli [e.g.,antibiotics, different environment, etc.], and profiles of alldetectable molecules, including molecules that are chemically at leastin part a nucleic acids, proteins, carbohydrates, proteoglycans,glycoproteins, or lipids)
 99. The method of claim 95, wherein step b)includes the introduction of at least 15 stacked traits
 100. The methodof claim 95, wherein step b) includes the introduction of at least 50stacked traits
 101. The method of claim 95, wherein step b) includes theintroduction of at least 100 stacked traits
 102. The method of claim 96,wherein step a) includes screening cellular characteristics by utilizingone or any combination of the following methods: a) genomics; b)transcriptome characterization or RNA profiling; c) proteomics; d)metabolomics or the analysis of metabolites; e) lipidomics or lipidprofiling.
 103. A method of claim 102, wherein proteomics specificallyincludes the use of amino acid reactive tags
 104. A method of claim 97,wherein step d) includes screening cellular characteristics by utilizingone or any combination of the following methods: f) genomics; g)transcriptome characterization or RNA profiling; h) proteomics; i)metabolomics or the analysis of metabolites; j) lipidomics or lipidprofiling.
 105. A method of claim 104, wherein proteomics specificallyincludes the use of amino acid reactive tags
 106. A method of claim 98,wherein steps a) and d) include screening cellular characteristics byutilizing one or any combination of the following methods: k) genomics;l) transcriptome characterization or RNA profiling; m) proteomics; n)metabolomics or the analysis of metabolites; o) lipidomics or lipidprofiling. P)
 107. A method of claim 106, wherein proteomicsspecifically includes the use of amino acid reactive tags
 108. A methodof claim 73, wherein step c) includes screening cellular characteristicsby utilizing one or any combination of the following methods: q)genomics; r) transcriptome characterization or RNA profiling; s)proteomics; t) metabolomics or the analysis of metabolites; u)lipidomics or lipid profiling.
 109. A method of claim 108, whereinproteomics specifically includes the use of amino acid reactive tags110. A method of claim 93, wherein step c) includes screening cellularcharacteristics by utilizing one or any combination of the followingmethods: v) genomics; w) transcriptome characterization or RNAprofiling; x) proteomics; y) metabolomics or the analysis ofmetabolites; z) lipidomics or lipid profiling.
 111. A method of claim110, wherein proteomics specifically includes the use of amino acidreactive tags
 112. A method of claim 94, wherein step c) includesscreening cellular characteristics by utilizing one or any combinationof the following methods: aa) genomics; bb) transcriptomecharacterization or RNA profiling; cc) proteomics; dd) metabolomics orthe analysis of metabolites; ee) lipidomics or lipid profiling.
 113. Amethod of claim 112, wherein proteomics specifically includes the use ofamino acid reactive tags
 114. A method for whole cell engineering of newor modified phenotypes by using real-time metabolic flux analysis, themethod comprising the following steps: (a) making a modified cell bymodifying the genetic composition of a cell; (b) culturing the modifiedcell to generate a plurality of modified cells; (c) measuring at leastone metabolic parameter of the cell by monitoring the cell culture ofstep (b) in real time; and, (d) analyzing the data of step (c) todetermine if the measured parameter differs from a comparablemeasurement in an unmodified cell under similar conditions, therebyidentifyng an engineered phenotype in the cell using real-time metabolicflux analysis.
 115. The method of claim 114, wherein the geneticcomposition of the cell is modified by a method comprising addition of anucleic acid to the cell.
 116. The method of claim 115, wherein thenucleic acid comprises a nucleic acid heterologous to the cell.
 117. Themethod of claim 115, wherein the nucleic acid comprises a nucleic acidhomologous to the cell.
 118. The method of claim 117, wherein thehomologous nucleic acid comprises a modified homologous nucleic acid.119. The method of claim 118, wherein the homologous nucleic acidcomprises a modified homologous gene.
 120. The method of claim 114,wherein the genetic composition of the cell is modified by a methodcomprising deletion of a sequence or modification of a sequence in thecell.
 121. The method of claim 114, wherein the genetic composition ofthe cell is modified by a method comprising modifying or knocking outthe expression of a gene.
 122. The method of claim 114, furthercomprising selecting a cell comprising a newly engineered phenotype.123. The method of claim 122, further comprising culturing the selectedcell, thereby generating a new cell strain comprising a newly engineeredphenotype.
 124. The method of claim 122, wherein the newly engineeredphenotype is selected from the group consisting of an increased ordecreased expression or amount of a polypeptide, an increased ordecreased amount of an mRNA transcript, an increased or decreasedexpression of a gene, an increased or decreased resistance orsensitivity to a toxin, an increased or decreased resistance use orproduction of a metabolite, an increased or decreased uptake of acompound by the cell, an increased or decreased rate of metabolism, andan increased or decreased growth rate.
 125. The method of claim 114,further comprising isolating a cell comprising a newly engineeredphenotype.
 126. The method of claim 114, wherein the newly engineeredphenotype is a stable phenotype.
 127. The method of claim 126, whereinmodifying the genetic composition of a cell comprises insertion of aconstruct into the cell, wherein construct comprises a nucleic acidoperably linked to a constitutively active promoter.
 128. The method ofclaim 114, wherein the newly engineered phenotype is an induciblephenotype.
 129. The method of claim 128, wherein modifying the geneticcomposition of a cell comprises insertion of a construct into the cell,wherein construct comprises a nucleic acid operably linked to aninducible promoter.
 130. The method of claim 115, wherein nucleic acidadded to the cell in step (a) is stably inserted into the genome of thecell.
 131. The method of claim 115, wherein nucleic acid added to thecell in step (a) propagates as an episome in the cell.
 132. The methodof claim 115, wherein nucleic acid added to the cell in step (a) encodesa polypeptide.
 133. The method of claim 132, wherein the polypeptidecomprises a modified homologous polypeptide.
 134. The method of claim132, wherein the polypeptide comprises a heterologous polypeptide. 135.The method of claim 115, wherein the nucleic acid added to the cell instep (a) encodes a transcript comprising a sequence that is antisense toa homologous transcript.
 136. The method of claim 114, wherein modifyingthe genetic composition of the cell in step (a) comprises increasing ordecreasing the expression of an mRNA transcript.
 137. The method ofclaim 114, wherein modifying the genetic composition of the cell in step(a) comprises increasing or decreasing the expression of a polypeptide.138. The method of claim 114, wherein modifying the homologous gene instep (a) comprises knocking out expression of the homologous gene. 139.The method of claim 114, wherein modifying the homologous gene in step(a) comprises increasing the expression of the homologous gene.
 140. Themethod of claim 114, wherein the heterologous gene in step (a) comprisesa sequence-modified homologous gene, wherein the sequence modificationis made by a method comprising the following steps: (a) providing atemplate polynucleotide, wherein the template polynucleotide comprises ahomologous gene of the cell; (b) providing a plurality ofoligonucleotides, wherein each oligonucleotide comprises a sequencehomologous to the template polynucleotide, thereby targeting a specificsequence of the template polynucleotide, and a sequence that is avariant of the homologous gene; (c) generating progeny polynucleotidescomprising non-stochastic sequence variations by replicating thetemplate polynucleotide of step (a) with the oligonucleotides of step(b), thereby generating polynucleotides comprising homologous genesequence variations.
 141. The method of claim 114, wherein theheterologous gene in step (a) comprises a sequence-modified homologousgene, wherein the sequence modification is made by a method comprisingthe following steps: (a) providing a template polynucleotide, whereinthe template polynucleotide comprises sequence encoding a homologousgene; (b) providing a plurality of building block polynucleotides,wherein the building block polynucleotides are designed to cross-overreassemble with the template polynucleotide at a predetermined sequence,and a building block polynucleotide comprises a sequence that is avariant of the homologous gene and a sequence homologous to the templatepolynucleotide flanking the variant sequence; (c) combining a buildingblock polynucleotide with a template polynucleotide such that thebuilding block polynucleotide cross-over reassembles with the templatepolynucleotide to generate polynucleotides comprising homologous genesequence variations.
 142. The method of claim 114, wherein the cell is aprokaryotic cell.
 143. The method of claim 142, wherein the prokaryoticcell is a bacterial cell.
 144. The method of claim 114, wherein the cellis a selected from the group consisting of a fungal cell, a yeast cell,a plant cell and an insect cell.
 145. The method of claim 114, whereinthe cell is a eukaryotic cell.
 146. The method of claim 145, wherein thecell is a mammalian cell.
 147. The method of claim 146, wherein themammalian cell is a human cell.
 148. The method of claim 114, whereinthe measured metabolic parameter comprises rate of cell growth.
 149. Themethod of claim 148, wherein the rate of cell growth is measured by achange in optical density of the culture.
 150. The method of claim 114,wherein the measured metabolic parameter comprises a change in theexpression of a polypeptide.
 151. The method of claim 150, wherein thechange in the expression of the polypeptide is measured by a methodselected from the group consisting of a one-dimensional gelelectrophoresis, a two-dimensional gel electrophoresis, a tandem massspectography, an RIA, an ELISA, an immunoprecipitation and a Westernblot.
 152. The method of claim 114, wherein the measured metabolicparameter comprises a change in expression of at least one transcript,or, the expression of a transcript of a newly introduced gene.
 153. Themethod of claim 152, wherein the change in expression of the transcriptis measured by a method selected from the group consisting of ahybridization, a quantitative amplification and a Northern blot. 154.The method of claim 153, wherein transcript expression is measured byhybridization of a sample comprising transcripts of a cell or nucleicacid representative of or complementary to transcripts of a cell byhybridization to immobilized nucleic acids on an array.
 155. The methodof claim 114, wherein the measured metabolic parameter comprises anincrease or a decrease in a secondary metabolite.
 156. The method ofclaim 155, wherein secondary metabolite is selected from the groupconsisting of a glycerol and a methanol.
 157. The method of claim 114,wherein the measured metabolic parameter comprises an increase or adecrease in an organic acid.
 158. The method of claim 157, wherein theorganic acid is selected from the group consisting of an acetate, abutyrate, a succinate and an oxaloacetate.
 159. The method of claim 114,wherein the measured metabolic parameter comprises an increase or adecrease in intracellular pH.
 160. The method of claim 159, wherein theincrease or a decrease in intracellular pH is measured by intracellularapplication of a dye, and the change in fluorescence of the dye ismeasured over time.
 161. The method of claim 114, wherein the measuredmetabolic parameter comprises an increase or a decrease in synthesis ofDNA over time.
 162. The method of claim 161, wherein the increase or adecrease in synthesis of DNA over time is measured by intracellularapplication of a dye, and the change in fluorescence of the dye ismeasured over time.
 163. The method of claim 114, wherein the measuredmetabolic parameter comprises an increase or a decrease in uptake of acomposition.
 164. The method of claim 163, wherein the composition is ametabolite.
 165. The method of claim 164, wherein the metabolite isselected from the group consisting of a monosaccharide, a disaccharide,a polysaccharide, a lipid, a nucleic acid, an amino acid and apolypeptide.
 166. The method of claim 165, wherein the saccharide,disaccharide or polysaccharide comprises a glucose or a sucrose. 167.The method of claim 163, wherein the composition is selected from thegroup consisting of an antibiotic, a metal, a steroid and an antibody.168. The method of claim 114, wherein the measured metabolic parametercomprises an increase or a decrease in the secretion of a byproduct or asecreted composition of a cell.
 169. The method of claim 168, whereinthe byproduct or secreted composition is selected from the groupconsisting of a toxin, a lymphokine, a polysaccharide, a lipid, anucleic acid, an amino acid, a polypeptide and an antibody.
 170. Themethod of claim 114, wherein the real time monitoring simultaneouslymeasures a plurality of metabolic parameters.
 171. The method of claim170, wherein real time monitoring of a plurality of metabolic parameterscomprises use of a Cell Growth Monitor device.
 172. The method of claim171, wherein the Cell Growth Monitor device is a Wedgewood Technology,Inc., Cell Growth Monitor model
 652. 173. The method of claim 171,wherein the real time simultaneous monitoring measures uptake ofsubstrates, levels of intracellular organic acids and levels ofintracellular amino acids.
 174. The method of claim 171, wherein thereal time simultaneous monitoring measures: uptake of glucose; levels ofacetate, butyrate, succinate or oxaloacetate; and, levels ofintracellular natural amino acids.
 175. The method of claim 171, furthercomprising use of a computer-implemented program to real time monitorthe change in measured metabolic parameters over time.
 176. The methodof claim 175, wherein the computer-implemented program comprises acomputer-implemented method as set forth in FIG.
 28. 177. The method ofclaim 176, wherein the computer-implemented method comprises metabolicnetwork equations.
 178. The method of claim 176, wherein thecomputer-implemented method comprises a pathway analysis.
 179. Themethod of claim 176, wherein the computer-implemented program comprisesa preprocessing unit to filter out the errors for the measurement beforethe metabolic flux analysis.